Why Databricks Lakehouse was an obvious choice for ADP’s Data Platform Modernization

Data platform modernization is imperative for innovation and digital transformation across industries in today’s data-driven world. However, as data volume, velocity, and complexity increase, traditional data warehousing solutions often fail to store, manage, and process data from multiple sources at scale to meet the demands of advanced analytics.

Databricks Lakehouse is a popular choice for legacy platform modernization. It combines the best elements of a data warehouse and data lake and eliminates data silos to deliver reliability, strong governance, high performance, and flexibility. The Databricks Lakehouse Platform also supports business intelligence (BI) tools, enabling users to connect to the platform and quickly access, analyze, and visualize their data.

However, modernizing legacy platforms with zero business disruption is challenging for many enterprises. This blog details how ADP partnered with Impetus, the data platform modernization experts, to migrate one of their Oracle applications to their established Databricks Lakehouse.

Business needs that prompted modernization to Databricks

ADP, a global leader in human resources technology and services, was looking for a modernization partner to migrate its legacy platform with more than 200TB of data from Oracle to the cloud. They wanted a modern platform to integrate data from different sources, process them, and analyze them for reporting and advanced analytics while ensuring security, privacy, accuracy, and consistency.

Some of the challenges that encouraged ADP to look for an option in the cloud are:

Data silos: With the information of thousands of clients stored across multiple databases and schemas, data duplication was rampant as databases were not in sync.

Costly scalability: The platform could not be scaled up quickly to onboard new customers. Moreover, vertical scalability is expensive, leading to high costs.

Lack of real-time processing: Frequent data updates are not captured due to a lack of real-time processing capability, leading to delayed response.

Why did ADP choose Impetus and Databricks as their modernization partners?

ADP chose Impetus as they have rich cloud and data engineering expertise with various cloud data platforms like Databricks, AWS, Azure etc, along with automated solutions for migrating legacy data platforms to the cloud. ADP had evaluated and chosen Databricks as an obvious choice because of the following reasons:

Single source of truth (SSOT): The Databricks Lakehouse eliminates the need for creating and syncing copies of data across multiple systems by unifying data access and storage in a single system, establishing the Lakehouse as the single source of truth (SSOT).

Scalability: Databricks cloud-based platform makes it easy to handle large datasets and perform complex analytics tasks.

Unified analytics: Databricks provides a unified platform that integrates various data processing and analytics tools, such as Apache Spark, Delta Lake, and MLflow.

Cost-effective: Databricks provides a consumption-based model with auto-scaling and auto-termination features that allow organizations to pay only for the resources they use and scale up dynamically.

Collaborative environment: Databricks provides a collaborative environment that allows teams to work together on data processing, analytics, and machine learning tasks.

How the Databricks solution architecture was developed

A Multi-hop medallion architecture was designed on top of Delta Lake, an open-source storage layer, that enables building Lakehouse on top of the existing storage systems over cloud objects. It also provides features like ACID transactions, data caching and indexing, reliability, schema enforcement, and time travel.

The multi-hop medallion architecture progressively enriched data across the bronze, silver, and gold layer of the Delta tables. The bronze layer had raw data with no transformations. In the silver layer, data was cleansed, joined, and transformed to normalized tables, making it ready for self-service analytics and ad-hoc reporting and futuristic machine learning & Data Science needs. In the gold layer, business data was aggregated, enriched, and denormalized.

Quick tips: improving the performance of medallion architecture

Enable Delta Cache: Databricks support Delta and Spark caching. When the disk cache is enabled, data fetched from a remote source is automatically added to the cache. However, to preload data into the cache, use the CACHE SELECT command.

Use Photon runtime: Photon, a native vectorized query engine on Databricks, supports SQL and equivalent Data Frame operations against Delta and Parquet tables. It accelerates queries on huge amounts of data, includes aggregations and joins to improve performance during repeated access of data from the disk cache, and scans performance on tables with multiple columns and small files.

Use Query profile: To troubleshoot performance bottlenecks during the query’s execution and collect metrics such as the time spent, number of rows processed, rows processed, and memory consumption.

The diagram below depicts the architecture built on top of Delta Lake:

Data ingestion on Databricks Delta tables

AWS Database Management Services (DMS) is used to fetch historical data from sources like Oracle and MySQL and push it into the landing zone S3 Buckets. Additionally, to retrieve CDC data from Oracle and migrate it to S3, the team used Oracle GoldenGate. Meta-driven ingestion framework was used to check for any new files in S3 and load all the data – historical, CDC, and new data files – on the bronze layer, Databricks Delta tables.

Data transformation and aggregation

After loading the data on Databricks Delta tables, Databricks Notebook jobs were used to verify the data. After verification, the team used Spark APIs to cleanse, transform, and aggregate the data. Further, Delta Lake APIs along with Delta Live Tables were used to write the transformed data to the silver/gold layer.

Once the aggregated data is available, the team scheduled Databricks jobs to make the data ready for use cases like reporting, BI consumption & machine learning.

Databricks features that helped in the implementation

Automated clustering: All-purpose job clusters to run workloads as a set of commands in a Notebook or as an automated job.
Interactive Databricks Notebooks: For quick prototyping, debugging, and experimenting with data processing pipelines.
Automated job scheduling: For scheduling and automating data processing jobs.
Delta Live Tables: Declarative framework for building reliable, maintainable, and testable data processing pipelines by managing task orchestration, cluster management, monitoring, data quality, and error handling.

Data analytics with Power BI on Databricks

Once in the gold layer, the data is ready for analytics consumption. Power BI is recommended for analytics as it provides rich and interactive visualizations and is supported by Databricks and Databricks Partner Connect which allows a SQL warehouse to access external data.

In the future, ADP has the scope to leverage Databricks data governance and enforce permission at table and database levels with Unity Catalog. The diagram below explains how to achieve data security in Databricks Lakehouse.

Migration made easy with Databricks capabilities and Impetus partnership

Legacy platform modernization can be daunting. This collaboration (ADP, Impetus & Databricks) ensured seamless migration and operationalization of ADP’s legacy Oracle platform, thereby improving data pipeline performance, and enabled real-time data access for business users to increase productivity. Other benefits of the migration include:

Enhanced operational efficiency with a single source of truth enabled by Delta Lake
Compute cost reduction with data processing rationalization

Beginning of a data-driven future

Taking the modernization, a step ahead, a self-service platform on Databricks has been strategized to enable business users to perform ad-hoc analysis and bring their data to analyze and mashup with the integrated data. To achieve this, Databricks has SQL Serverless, which provides instant computing, requires minimal management, is cost-effective, and is compatible with multiple BI and SQL tools.

To facilitate data sharing from Databricks Lakehouse to external customers, the Databricks’ Delta sharing capabilities can be explored, which enables the secure exchange of datasets across products and platforms. Shared data can be visualized using Power BI Delta Sharing connector.

Client Quote

“Impetus has been a tremendous partner in the modernization journey for our Analytics product from Oracle to Databricks. Impetus partnership has been a key to the implementation of our future-proof architecture on Databricks in a short amount of time, they have been instrumental in partnering with us to operationalize real-time data ingestion and access for business users to increase productivity.”

— Zafrir Babin, Vice President, Product Development, ADP

Augment data lake performance with the Databricks Lakehouse to ensure analytics success

Download e-book

Authors

John Ebenezer
Senior Director of Engineering, Impetus

Zafrir Babin
Vice President – Product Development, ADP

Vibhor Shukla
Senior Data Director – Data Science, ADP

Choose a lab aligned to your Data & AI journey

Address your desired use case across critical analytic dimensions

DESIGN LAB

Explore architecture options with experts
Ensure strategic alignment of business and technology
Architect an ideal solution for a pressing problem

Get Started with a Design Lab for $0

BUILD LAB

Validate new or refactor existing architecture
Develop a prototype with expert guidance
Establish a roadmap to production

Get Started with a Build Lab for $20K USD

Learn more about Data & AI Labs

Learn more about how our work can support your enterprise

Explore more resources

Review our services

Get in touch

Cookie	Duration	Description
__cf_bm	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
_grecaptcha	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
_GRECAPTCHA	179 days	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
CONSENT	2 years	Used to detect if the visitor has accepted the marketing category in the cookie banner. This cookie is necessary for GDPR-compliance of the website.
li_gc	179 days	Stores the user's cookie consent state for the current domain.
pa_enabled	1 day	Determines the device used to access the website. Th is allows the website to be formatted accordingly.
rc::a	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
rc::b	1 day	This cookie is used to distinguish between humans and bots.
rc::d-15#	1 day	This cookie is used to distinguish between humans and bots.
test_cookie	1 day	Used to check if the user's browser supports cookies.
visitorId	1 year	Preserves users states across page requests.

Cookie	Duration	Description
_cc_cc	1 day	Collects statistical data related to the user's website visits, such as the n umber of visits, average time spent on the website and what pages have been loaded. The purpose is to segment the website's users according to factors such as demographics and geographical location , in order to enable media and marketing agencies to structure and understand their target groups to enable customised on line advertising.
_gcl_au	3 months	Used by Google AdSense for experimenting with advertisement efficiency across websites using their services.
ads/ga-audiences	1 day	Used by Google AdWords to re-engage visitors that are likely to convert to customers based on the visitor's on line behaviour across websites.
bcookie	1 year	Used by the social networking service, LinkedIn , for tracking the use of embedded services.
bscookie	1 year	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
demdex	179 days	Via a unique ID that is used for semantic content analysis, the user's n avigation on the website is registered and linked to offline data from surveys and similar registrations to display targeted ads.
dpm	179 days	Sets a unique ID for the visitor, that allows third party advertisers to target the visitor with relevant advertisement. This pairing service is provided by third party advertisement hubs, which facilitates real-time bidding for advertisers.
IDE	1 year	Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.
lang	1 day	Set by LinkedIn when a webpage contains an embedded "Follow us" panel.
lidc	1 day	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
lpv#	1 day	Used in context with behavioral tracking by the website. The cookie registers the user’s behavior and navigation across multiple websites and ensures that no tracking errors occur when the user has multiple browser-tabs open.
pagead/1p-user-list/#	1 day	Tracks if the user has shown interest in specific products or events across multiple websites and detects how the user navigates between sites. This is used for measurement of advertisement efforts and facilitates payment of referral-fees between websites.
pixel.gif	1 day	Collects in formation on user preferences and/or interaction with web-campaign content - This is used on CRM-campaign -platform used by website owners for promoting events or products.
site/#	1 day	Unclassified.
ssi	1 year	Registers a unique ID that identifies a returning user's device. The ID is used for targeted ads.
u	1 year	Collects data on user visits to the website, such as what pages have been accessed. The registered data is used to categorise the user's interest and demographic profiles in terms of resales for targeted marketing.
UserMatchHistory	29 days	Ensures visitor browsing-security by preventing cross-site request forgery. This cookie is essential for the security of the website and visitor.
visitor_id#	10 years	Used in context with Account-Based-Marketing (ABM). The cookie registers data such as IP-addresses, time spent on the website and page requests for the visit. This is used for retargeting of multiple users rooting from the same IP addresses. ABM usually facilitates B2B marketing purposes.
visitor_id#-hash	10 years	Used to encrypt and contain visitor data. This is necessary for the security of the user data.
VISITOR_INFO1_LIVE	179 days	Tries to estimate the users' band width on pages with integrated YouTube videos.
w/1.0/cm	1 day	Presents the user with relevant content and advertisement. The service is provided by third-party advertisement hubs, which facilitate real-time bidding for advertisers.
YSC	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt-remote-cast-available	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-cast-installed	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-device-id	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-fast-check-period	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-session-name	1 day	Stores the user's video player preferences using embedded YouTube video.
yt.innertube::nextId	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt.innertube::requests	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt.innertube::requests	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
ytidb::LAST_RESULT_ENTRY_KEY	1 day	Stores the user's video player preferences using embedded YouTube video.

Cookie	Duration	Description
__utm.gif	1 day	Google Analytics Tracking Code that logs details about the visitor's browser and computer.
__utma	2 years	Collects data on the number of times a user has visited the website as well as dates for the first and most recent visit. Used by Google Analytics.
__utmb	1 day	Registers a timestamp with the exact time of when the user accessed the website. Used by Google Analytics to calculate the duration of a website visit.
__utmc	1 day	Registers a timestamp with the exact time of when the user leaves the website. Used by Google Analytics to calculate the du ration of a website visit.
__utmt	1 day	Used to throttle the speed of requests to the server.
__utmz	6 months	Collects data on where the user came from, what search engine was used, what link was clicked and what search term was used. Used by Google Analytics.
_omappvp	11 years	This cookie is used to determine if the visitor has visited the website before, or if it is a new visitor on the website.
_omappvs	1 day	This cookie is used to determine if the visitor has visited the website before, or if it is a new visitor on the website.
ab	1 year	This cookie is used by the website’s operator in context with multi-variate testing. This is a tool used to combine or change content on the website. This allows the website to find the best variation /edition of the site.
AnalyticsSyncHistory	29 days	Used in connection with data-synchronization with third-party analysis service.
omVisits	1 day	This cookie is used to identify the frequency of visits and how long the visitor is on the website. The cookie is also used to determine how many and which subpages the visitor visits on a website – this in formation can be used by the website to optimize the domain and its subpages.
omVisitsFirst	1 day	This cookie is used to count how many times a website has been visited by different visitors - this is done by assigning the visitor an ID, so the visitor does not get registered twice.
pa	1 day	Registers the website's speed and performance. This function can be used in context with statistics and load-balan cing.
ziwsSession	1 day	Collects statistics on the user's visits to the website, such as the number of visits, average time spent on the website and what pages have been read.
ziwsSessionId	1 day	Collects statistics on the user's visits to the website, such as the number of visits, average time spent on the website and what pages have been read.