Challenges
A US-based Fortune Global 500 healthcare service provider wanted to migrate their Cloudera-based data lake and analytical workloads to the Google Cloud Platform (GCP) to reduce operational overheads and take advantage of advanced analytics. They wanted to migrate storage and compute services to optimize cost, time, resource utilization, and enhance productivity. Since their Cloudera license was expiring in 2 months, they were looking for a partner to help them define an efficient solution architecture on GCP in a short time frame.

Expedited move from on-premise CDH-based data lake to Google Cloud
Requirements
The healthcare service provider’s cloud enablement strategy is based on:
- Creating a Google Cloud Storage (GCS)-based enterprise-grade fully functional data lake to act as a single source of data for all business use cases
- Creating high-performance data ingestion pipelines for pulling data from a variety of data sources like Oracle, MS SQL, Hana, PostgreSQL, and MySQL
- Supporting batch ingestion for bulk and incremental load for multiple data formats like plain text, JSON, and XML
- Automating ETL processing to reduce development error and time-to-market
- Using Dataproc as the compute platform and Apache Hive as the compute engine
- Ensuring data security and compliance as per HIPAA guidelines

35% more revenue and 85% less maintenance cost
Solution
The Impetus team created a GCS-based data lake to ingest 40 TB of historical data with 10 TB daily feeds, process, and validate that data for advanced analytics. The solution used 450 data ingestion pipelines on Apache Sqoop to extract data from CDH to GCS raw bucket. A framework was created to automate the ETL process using Shell scripts and Python-based utilities, which reduced development error and time-to-market. It leveraged Apache Hive dataflow jobs to run transformation rules, and a Scala-based solution to validate and persist the cleansed data into the GCS staging bucket.
Highlights
- Configured GCS buckets from multi-regional storage class with separate raw and staging layers to ensure high availability and low latency of data across geographies
- Configured multiple ACLs for specific folders to ensure restricted data visibility
- Used Google-managed encryption keys for data security and HIPPA compliance
- Created Cloud Dataproc clusters with an auto-scale policy on primary and secondary worker nodes with a cool-down period of 4 minutes. This helped save cost by ensuring that the idle GCE instances used minimum nodes
- Integrated IBM Tivoli Workload Scheduler with a legacy Remedy application to schedule and monitor Apache Hive transformation jobs for failover support
- Configured data lifecycle management rules in JSON for archival – blob objects older than 13 months were first moved to Nearline Storage and then to Coldline Storage to reduce storage cost
- Used GCP Cloud Monitoring suite to set up alerting policies on multiple resource types like CPU and disk utilization, Cloud Dataproc yarn nodes state, etc.

65% reduction in turnaround time with scalable GCP solution
The solution enabled the client to migrate their data lake to GCP in a short period of 2 months, increasing the overall revenue by 35% with reduced operations cost. The Google Cloud data lake was 85% cheaper than their on-premise Cloudera data lake, resulting in valuable cost savings. Moreover, the migration helped the client eliminate/simplify many tasks and respond faster to new business opportunities.

99.9999999999% (12 9’s) data availability and 3x faster operationalization of data pipelines
Impact
The solution enabled complex data analytics and ML data modeling on claims data in the cloud by exposing an API for efficiently querying and retrieving data in the cloud. By ensuring independent scalability of computing and storage, the client could reduce data platform operational costs by 35% across all use cases. It also facilitated the operationalization of new data pipelines 3x faster, leveraging the platform’s out-of-the-box processes and components.
Business benefits
- Operationalize Power BI reporting for clients
- Meet the desired data availability SLA of 24 hours post-processing the monthly transaction data on SQL Server
- Increase data availability to 99.9999999999% (12 9’s) over a given year
- Reduce data storage and management cost by 50%nt cost by 50%