Data ingestion and preparation with Snowflake on Azure
Snowflake is a popular cloud data warehouse choice for scalability, agility, cost-effectiveness, and a comprehensive range of data integration tools. Enterprises prefer Snowflake for features like continuous bulk loading/unloading, data protection, secure data sharing, standard and extended SQL support, a broad range of connectors and drivers, and support for syncing and replicating databases.
At Impetus Technologies, we have extensive experience helping large-scale companies seamlessly migrate legacy data lakes to the cloud. This article details how we have helped a US-based Fortune Global 500 healthcare service provider migrate to a high performant, scalable cloud-based data lake to minimize cost and data ingestion time.
The healthcare service provider was facing the following challenges with legacy data processing pipelines:
- Managing hardware infrastructure – Spin up servers, batch ETL job orchestration, outages and downtime, integration of tools to ingest, organize, pre-process, and query data stored in the lake
- High maintenance costs – Capital expenses like servers and storage disks, operating expenses like IT and engineering costs
- Limited scalability – Manual addition and configuration of servers, monitoring resource utilization, and maintenance of additional servers
To address these challenges, we proposed migrating the legacy data platform to a cloud-based solution built on Snowflake that would:
- Support data ingestion from a variety of relational databases like Oracle, MS SQL, and SAP Hana using IBM InfoSphere DataStage (ETL jobs)
- Cleanse and prepare raw data using Snowflake SQL
- Store data on Snowflake storage
- Orchestrate, schedule, and monitor batch ETL jobs using IBM Tivoli Workload Automation
- Integrate IBM Tivoli Workload Scheduler with Remedy Incident Management System for sending email notifications for failures
Here are the details of the application architecture on Snowflake:
The healthcare service provider wanted to retain their existing data ingestion infrastructure, which involved ingesting data files from relational databases like Oracle, MS SQL, and SAP Hana and converging them with the Snowflake storage.
We modified the existing IBM Infosphere Information Server DataStage 11.7 ETL pipelines to dump data into Snowflake raw tables, which has a native Snowflake Connector to write, read, and load data into the Snowflake data warehouse and integrate it into the ETL job design.
We used the Load functionality in Snowflake, which is best suited for writing huge volumes of data into the database without affecting the speed and compute capacity.
Depending on the data ingestion frequency and business requirement, the pipeline pulled the data, automatically identified table schema, and created raw tables with various metadata (columns, partitions) for downstream data transformations. The ETL job performed multiple operations like data filtering, validation, data enrichment, compression, and stored the data on Azure Blob Storage in Parquet format.
Data cleansing and preparation
Once the data was loaded on the Snowflake raw table, we used features like Streams and Tasks to build and orchestrate end-to-end data pipelines in Snowflake, which provide a task scheduling mechanism to ensure zero dependencies on external jobs for common scheduling needs. These features connect raw tables and downstream target tables with regularly processed logic, which picks new data from the raw table and transforms it for the target table. Splitting the process into two helped us handle failover better, as a single process could be re-run in case of failures.
For SCD Type I jobs, we used Snowflake Merge functionality for updates, deletes, and inserts into target Snowflake tables. For SCD Type II jobs, data loaded into delta raw tables as part of the ETL job was added to the target Snowflake table.
Snowflake SQL jobs validated the data by:
- Recording counts during ingestion start time, persisting into the audit table and comparing the result with the target table to ensure that the difference is within an allowed value range
- Comparing the aggregated result on a numeric field to match with the source side
- Capturing metadata at the beginning of data preparation and ensuring it matches with the target Snowflake table
Our primary data lake was in the Eastern US since most business units were based out of that region. For this particular business unit in Western Europe, we shared the Snowflake target tables securely without duplicating the ingestion effort, enabling them to analyze the data feeds replicated using Snowflake Data Share.
We ensured data security in the enterprise data lake by implementing the following measures:
- Network policy – Restricted access to Snowflake database accounts with a user-specific IP address, which enabled whitelisting or blacklisting a range of IP addresses
- Azure Private Link – Private connectivity to ensure access to Snowflake through a private IP address. With Microsoft as the backbone, traffic was allowed only from the virtual network to Snowflake VNet
- Data encryption – End-to-end encryption allowing only the customer and runtime components to read data. The data was not visible to third parties, including Snowflake’s cloud computing platform or other ISPs.
- Authentication and authorization – Snowflake’s out-of-the-box granular control was leveraged to access database objects
The scalable solution democratized data access, improved performance, and enabled the users to connect to raw source-system data to curate enterprise data without involving IT teams. It also had the following benefits:
Lowered ingestion time by 60%
Enabled $500K annual savings
Ensured availability of high-quality data across the organization for advanced analytics
Ingested data from 41+ source systems, including 25000+ tables, with 20+ TB of data
Supported 4 million queries per month
This is just one example of how Impetus has helped enterprises across industries achieve their cloud transformation goals. With accelerators for adoption and management and an experienced team, we can help you accelerate data lake creation (in days versus months), large-scale migration (in months versus years), and workload management on the cloud to reduce time-to-market and overall costs. We can also help you with the right technology choices, engineering, and implementation across cloud providers and domains. Contact us to know more.
Disclaimer: Product and company names mentioned herein are trademarks of their respective companies.