Fortune 500 healthcare service provider prevented a legacy data lake renewal with Google Cloud Platform
Reduced time-to-market and operations cost for CDH data lake migration to cloud data lake
A US-based Fortune Global 500 healthcare service provider wanted to migrate their Cloudera-based data lake and analytical workloads to the Google Cloud Platform (GCP) to reduce operational overheads and take advantage of advanced analytics. They wanted to migrate storage and compute services to optimize cost, time, resource utilization, and enhance productivity. Since their Cloudera license was expiring in 2 months, they were looking for a partner to help them define an efficient solution architecture on GCP in a short time frame.
The healthcare service provider’s cloud enablement strategy is based on:
- Creating a Google Cloud Storage (GCS)-based enterprise-grade fully functional data lake to act as a single source of data for all business use cases
- Creating high-performance data ingestion pipelines for pulling data from a variety of data sources like Oracle, MS SQL, Hana, PostgreSQL, and MySQL
- Supporting batch ingestion for bulk and incremental load for multiple data formats like plain text, JSON, and XML
- Automating ETL processing to reduce development error and time-to-market
- Using Dataproc as the compute platform and Apache Hive as the compute engine
- Ensuring data security and compliance as per HIPAA guidelines
The Impetus team created a GCS-based data lake to ingest 40 TB of historical data with 10 TB daily feeds, process, and validate that data for advanced analytics. The solution used 450 data ingestion pipelines on Apache Sqoop to extract data from CDH to GCS raw bucket. A framework was created to automate the ETL process using Shell scripts and Python-based utilities, which reduced development error and time-to-market. It leveraged Apache Hive dataflow jobs to run transformation rules, and a Scala-based solution to validate and persist the cleansed data into the GCS staging bucket.
- Configured GCS buckets from multi-regional storage class with separate raw and staging layers to ensure high availability and low latency of data across geographies
- Configured multiple ACLs for specific folders to ensure restricted data visibility
- Used Google-managed encryption keys for data security and HIPPA compliance
- Created Cloud Dataproc clusters with an auto-scale policy on primary and secondary worker nodes with a cool-down period of 4 minutes. This helped save cost by ensuring that the idle GCE instances used minimum nodes
- Integrated IBM Tivoli Workload Scheduler with a legacy Remedy application to schedule and monitor Apache Hive transformation jobs for failover support
- Configured data lifecycle management rules in JSON for archival – blob objects older than 13 months were first moved to Nearline Storage and then to Coldline Storage to reduce storage cost
- Used GCP Cloud Monitoring suite to set up alerting policies on multiple resource types like CPU and disk utilization, Cloud Dataproc yarn nodes state, etc.
The solution enabled the client to migrate their data lake to GCP in a short period of 2 months, increasing the overall revenue by 35% with reduced operations cost. The Google Cloud data lake was 85% cheaper than their on-premise Cloudera data lake, resulting in valuable cost savings. Moreover, the migration helped the client eliminate/simplify many tasks and respond faster to new business opportunities.
You may also be interested in…