A future-proof, centralized data warehouse with Snowflake on Azure
Data ingestion from complex sources like SAP Hana to create a single source of truth
Challenge
A US-based healthcare services provider wanted to revamp its analytics approach by consolidating data warehouse, ETL, and BI on the cloud to increase speed and flexibility. They wanted to move their CDH-based data lake to Snowflake on the cloud to ensure easy data availability and create a single source of truth across the organization.
Requirements
The healthcare services provider was looking for a future-proof, comprehensive, and flexible data warehouse to process data faster and deliver analytics right to the patient. They wanted a solution that would:
- Reduce total cost of ownership by enabling data share across regions
- Support data ingestion from a variety of relational databases like Oracle, MS SQL, PostgreSQL, MySQL, and SAP Hana
- Support batch ingestion with bulk and incremental load capabilities for multiple data formats
- Maintain high data quality across the data lifecycle
Solution
The Impetus team followed a 4-step approach to move the client’s data warehouse to Snowflake.
Ingestion
Curation
Validation
Consumption
Data ingestion
IBM DataStage and Snowflake SQL workflows were used to ingest data from CDH to Snowflake. To reduce consumption of billing credits for computing and ensure faster data ingestion, JDBC inserts were replaced by Snowflake’s bulk load feature.
Data curation
A data acquisition tier was developed on IBM DataStage to extract and load data from CDH on-premise. Snowflake Streams and Tasks cleansed and prepared the data for transformation. Splitting the cleansing and preparation process helped to handle failovers, where a single process was re-executed, in case of failure.
The solution also used:
- Snowflake Merge to update, delete, and insert data to target Snowflake tables
- Azure Private Link to access Snowflake Virtual Network with defined authorization policies
- Snowflake end-to-end encryption for data in motion
Data validation
The solution used a Snowflake SQL-based data validation component to check data quality. Data validation involved:
- Comparing record counts during ingestion start time with the target table
- Matching aggregate result on a numeric field with the source
- Matching metadata at the source with the target Snowflake table
Data consumption
The validated data, residing in the eastern US, was securely shared with consumers in western Europe using Snowflake Data Share. This helped avoid geo-replication of data, enabled granular control to share only what is required, and reduced latency.
Impact
The solution reduced the transformation effort of the existing team from one year to one month, lowered ingestion time by 60%, and led to $500K annual savings. It also ensured the availability of high-quality data across the organization for advanced analytics.
You may also be interested in…