Why Databricks Lakehouse was an obvious choice for ADP’s Data Platform Modernization - Impetus

Why Databricks Lakehouse was an obvious choice for ADP’s Data Platform Modernization

Data platform modernization is imperative for innovation and digital transformation across industries in today’s data-driven world. However, as data volume, velocity, and complexity increase, traditional data warehousing solutions often fail to store, manage, and process data from multiple sources at scale to meet the demands of advanced analytics.

Databricks Lakehouse is a popular choice for legacy platform modernization. It combines the best elements of a data warehouse and data lake and eliminates data silos to deliver reliability, strong governance, high performance, and flexibility. The Databricks Lakehouse Platform also supports business intelligence (BI) tools, enabling users to connect to the platform and quickly access, analyze, and visualize their data.

However, modernizing legacy platforms with zero business disruption is challenging for many enterprises. This blog details how ADP partnered with Impetus, the data platform modernization experts, to migrate one of their Oracle applications to their established Databricks Lakehouse.

Business needs that prompted modernization to Databricks

ADP, a global leader in human resources technology and services, was looking for a modernization partner to migrate its legacy platform with more than 200TB of data from Oracle to the cloud. They wanted a modern platform to integrate data from different sources, process them, and analyze them for reporting and advanced analytics while ensuring security, privacy, accuracy, and consistency.

Some of the challenges that encouraged ADP to look for an option in the cloud are:

Data silos: With the information of thousands of clients stored across multiple databases and schemas, data duplication was rampant as databases were not in sync.

Costly scalability: The platform could not be scaled up quickly to onboard new customers. Moreover, vertical scalability is expensive, leading to high costs.

Lack of real-time processing: Frequent data updates are not captured due to a lack of real-time processing capability, leading to delayed response.

Why did ADP choose Impetus and Databricks as their modernization partners?

ADP chose Impetus as they have rich cloud and data engineering expertise with various cloud data platforms like Databricks, AWS, Azure etc, along with automated solutions for migrating legacy data platforms to the cloud. ADP had evaluated and chosen Databricks as an obvious choice because of the following reasons:

Single source of truth (SSOT): The Databricks Lakehouse eliminates the need for creating and syncing copies of data across multiple systems by unifying data access and storage in a single system, establishing the Lakehouse as the single source of truth (SSOT).

Scalability: Databricks cloud-based platform makes it easy to handle large datasets and perform complex analytics tasks.

Unified analytics: Databricks provides a unified platform that integrates various data processing and analytics tools, such as Apache Spark, Delta Lake, and MLflow.

Cost-effective: Databricks provides a consumption-based model with auto-scaling and auto-termination features that allow organizations to pay only for the resources they use and scale up dynamically.

Collaborative environment: Databricks provides a collaborative environment that allows teams to work together on data processing, analytics, and machine learning tasks.

How the Databricks solution architecture was developed

A Multi-hop medallion architecture was designed on top of Delta Lake, an open-source storage layer, that enables building Lakehouse on top of the existing storage systems over cloud objects. It also provides features like ACID transactions, data caching and indexing, reliability, schema enforcement, and time travel.

The multi-hop medallion architecture progressively enriched data across the bronze, silver, and gold layer of the Delta tables. The bronze layer had raw data with no transformations. In the silver layer, data was cleansed, joined, and transformed to normalized tables, making it ready for self-service analytics and ad-hoc reporting and futuristic machine learning & Data Science needs. In the gold layer, business data was aggregated, enriched, and denormalized.

Quick tips: improving the performance of medallion architecture

Enable Delta Cache: Databricks support Delta and Spark caching. When the disk cache is enabled, data fetched from a remote source is automatically added to the cache. However, to preload data into the cache, use the CACHE SELECT command.

Use Photon runtime: Photon, a native vectorized query engine on Databricks, supports SQL and equivalent Data Frame operations against Delta and Parquet tables. It accelerates queries on huge amounts of data, includes aggregations and joins to improve performance during repeated access of data from the disk cache, and scans performance on tables with multiple columns and small files.

Use Query profile: To troubleshoot performance bottlenecks during the query’s execution and collect metrics such as the time spent, number of rows processed, rows processed, and memory consumption.

The diagram below depicts the architecture built on top of Delta Lake:

Data ingestion on Databricks Delta tables

AWS Database Management Services (DMS) is used to fetch historical data from sources like Oracle and MySQL and push it into the landing zone S3 Buckets. Additionally, to retrieve CDC data from Oracle and migrate it to S3, the team used Oracle GoldenGate. Meta-driven ingestion framework was used to check for any new files in S3 and load all the data – historical, CDC, and new data files – on the bronze layer, Databricks Delta tables.

Data transformation and aggregation

After loading the data on Databricks Delta tables, Databricks Notebook jobs were used to verify the data. After verification, the team used Spark APIs to cleanse, transform, and aggregate the data. Further, Delta Lake APIs along with Delta Live Tables were used to write the transformed data to the silver/gold layer.

Once the aggregated data is available, the team scheduled Databricks jobs to make the data ready for use cases like reporting, BI consumption & machine learning.

Databricks features that helped in the implementation

  • Automated clustering: All-purpose job clusters to run workloads as a set of commands in a Notebook or as an automated job.
  • Interactive Databricks Notebooks: For quick prototyping, debugging, and experimenting with data processing pipelines.
  • Automated job scheduling: For scheduling and automating data processing jobs.
  • Delta Live Tables: Declarative framework for building reliable, maintainable, and testable data processing pipelines by managing task orchestration, cluster management, monitoring, data quality, and error handling.

Data analytics with Power BI on Databricks

Once in the gold layer, the data is ready for analytics consumption. Power BI is recommended for analytics as it provides rich and interactive visualizations and is supported by Databricks and Databricks Partner Connect which allows a SQL warehouse to access external data.

In the future, ADP has the scope to leverage Databricks data governance and enforce permission at table and database levels with Unity Catalog. The diagram below explains how to achieve data security in Databricks Lakehouse.

Migration made easy with Databricks capabilities and Impetus partnership

Legacy platform modernization can be daunting. This collaboration (ADP, Impetus & Databricks) ensured seamless migration and operationalization of ADP’s legacy Oracle platform, thereby improving data pipeline performance, and enabled real-time data access for business users to increase productivity. Other benefits of the migration include:

  • Enhanced operational efficiency with a single source of truth enabled by Delta Lake
  • Compute cost reduction with data processing rationalization

Beginning of a data-driven future

Taking the modernization, a step ahead, a self-service platform on Databricks has been strategized to enable business users to perform ad-hoc analysis and bring their data to analyze and mashup with the integrated data. To achieve this, Databricks has SQL Serverless, which provides instant computing, requires minimal management, is cost-effective, and is compatible with multiple BI and SQL tools.

To facilitate data sharing from Databricks Lakehouse to external customers, the Databricks’ Delta sharing capabilities can be explored, which enables the secure exchange of datasets across products and platforms. Shared data can be visualized using Power BI Delta Sharing connector.

Client Quote

“Impetus has been a tremendous partner in the modernization journey for our Analytics product from Oracle to Databricks. Impetus partnership has been a key to the implementation of our future-proof architecture on Databricks in a short amount of time, they have been instrumental in partnering with us to operationalize real-time data ingestion and access for business users to increase productivity.”

— Zafrir Babin, Vice President, Product Development, ADP

Augment data lake performance with the Databricks Lakehouse to ensure analytics success

Authors

John Ebenezer
Senior Director of Engineering, Impetus

Zafrir Babin
Vice President – Product Development, ADP

Vibhor Shukla
Senior Data Director – Data Science, ADP

Choose a lab aligned to your Data & AI journey

Address your desired use case across critical analytic dimensions

  • Explore architecture options with experts

  • Ensure strategic alignment of business and technology

  • Architect an ideal solution for a pressing problem


  • Validate new or refactor existing architecture

  • Develop a prototype with expert guidance

  • Establish a roadmap to production


Learn more about how our work can support your enterprise