35% revenue increase in 2 months by migrating Cloudera data lake to Google Cloud

Business needs

A US-based F100 pharmaceutical distribution and healthcare company wanted to migrate their existing Cloudera-based data lake and OBIEE reporting system to the Google Cloud Platform. The company wanted to enhance scalability, reduce maintenance costs, and harness advanced analytics capabilities. Their migration goals encompassed:

Reducing technical debt by retiring Cloudera, whose license expires in 2 months.
Establishing an enterprise-grade data lake as a central repository for all business data needs.
Optimizing data ingestion pipelines for various sources such as Oracle, MS SQL, Hana, PostgreSQL, and MySQL.
Supporting both batch and incremental data ingestion across multiple formats.
Automating ETL processes to streamline development and minimize errors.
Leveraging Google Cloud as the compute platform with Apache Hive as the chosen compute engine.

3X faster operationalization of data pipelines by leveraging Google Cloud’s out-of-the-box processes and components

Solution

The Impetus team created a GCS-based data lake to ingest 40 TB of historical data with 10 TB daily feeds, and process and validate that data for advanced analytics. They identified workload complexity, interdependencies, resource-intensive, and poor-performing workloads. Additionally, the team:

Implemented 450 robust API-based data ingestion pipelines on Apache Sqoop to extract data from Cloudera to GCS raw bucket.
Developed a framework to automate the ETL process using Shell scripts and Python-based utilities, which reduced development errors and time-to-market.
Configured GCS buckets from multi-regional storage classes with separate raw and staging layers to ensure high availability.

Solution highlights

Configured GCS buckets with multi-regional storage classes for high availability and low latency across geographies.
Implemented multiple ACLs for specific folders to ensure restricted data visibility.
Utilized Google-managed encryption keys for data security and HIPPA compliance.
Established Cloud Dataproc clusters with auto-scale policies on primary and secondary worker nodes, saving costs by minimizing idle GCE instances.
Integrated IBM Tivoli Workload Scheduler with a legacy Remedy application for scheduling and monitoring Apache Hive transformation jobs, ensuring failover support.
Set up data lifecycle management rules in JSON for archival, moving blob objects older than 13 months to Nearline Storage and Coldline Storage to reduce storage costs.
Implemented alerting policies using the GCP Cloud Monitoring suite on various resource types, including CPU and disk utilization and Cloud Dataproc yarn nodes state.

Achieved >99.99% availability through Google Cloud’s geo-replication

Impact

Migrating to the Google Cloud Platform enabled data-driven decisions for product manufacturing, specialty drug distribution, and pharmacy retail operations through Google Cloud’s analytics functions. This positioned the company for sustained growth and competitiveness in the sector and had the following impact:

Achieved >99.99% availability through Google Cloud’s geo-replication, supporting scalable growth.
Reduced compute costs by 85%, aligning with the goal of lowering maintenance expenses.
Cut turnaround time by 65%, enabling faster responses to new opportunities.
Increased overall revenue by 35% within two months, reflecting the migration’s direct impact on financial performance.
Facilitated the operationalization of new data pipelines 3x faster, leveraging the platform’s out-of-the-box processes and components.

Choose a lab aligned to your Data & AI journey

Address your desired use case across critical analytic dimensions

STRATEGY LAB

Collaborate with experts on strategic objectives
Identify and select core technologies
Ensure IP governance and protection
Align business outcomes with goals

$10K value, complimentary for qualified organizations

DESIGN LAB

Explore architecture options with experts
Ensure alignment of business and technology
Architect an ideal solution for a pressing problem

$100K value, complimentary for qualified organizations

BUILD LAB

Validate or refactor existing architecture
Develop a prototype with expert guidance
Establish a roadmap to production

$240K value, offered for a $50K fixed price

Learn more about Data & AI Labs