Operationalized a highly available Azure-based data platform to migrate a legacy system

Challenges

A US-based healthcare software and services provider wanted to create an alternate Azure IaaS platform to replicate their existing data model from an on-premises SQL Server. Their current SQL server was choking with large volumes of data. It was unable to serve data in the cloud to facilitate predictive analytics and machine learning (ML) for analyzing use cases in near-real-time. Moreover, it could not easily scale to accommodate sudden spurts in processing capacity.

To address these challenges, the client wanted an IaaS platform on Azure to help:

Reduce data storage, management, and operational costs
Improve scalability
Migrate historical and incremental data
Reduce the time taken to develop new data pipelines
Facilitate complex analytics
Enable BI

35% reduction in data platform operational cost across use cases

Solution

Responsible for architecture, design, development, and DevOps of the platform, the Impetus team adopted a multi-phased step-by-step approach to create the IaaS platform on Azure.

The solution used Hortonworks Data Platform (HDP) on Azure with 1 Ambari node, 3 Master nodes, 8 Data nodes, 3 Edge nodes, 2 MySQL nodes, Azure File Share, and Azure Data Lake Storage (ADLS) Gen2 for data storage and archival with 48 TB HDFS space.

For historical data migration of around 10 TB data from File Share to ADLS Gen2, HDP was leveraged, while incremental data of 2.5 GB daily was migrated from SQL Server to ADLS Gen2 using Apache Sqoop.

A diagrammatic representation of the solution is given below:

Highlights

Set up HDP and HDF on Azure virtual machines to operationalize the platform with the required tool stack in one framework
Used Apache NiFi and Apache Oozie for managing data pipeline workflows, which reduced the overall development and cost effort
Used Apache Hive for data processing and Hive LLAP for data analytics and extraction. SQL-like Hive query structure helped the team to develop ETL scripts for around 170 tables. Hive facilitated data management in a schema-like on-premises SQL server, minimizing the effort required to change, test, and deploy the affected Hive scripts.
Replaced the existing SAP BO with Power BI for interactive reporting, which enabled the team to create a visually appealing analytical dashboard for client managers.
Encrypted Azure Blob Storage and volumes for security and HIPAA compliance to safeguard PHI information with custom key encryption
Configured Power BI dashboards to provide access to specific workspaces within the organization
Used MS Azure ExpressRoute to extend the on-premises network to the cloud over a private connection, which enabled the team to query and synchronize data in the cloud with on-premises data within 24 hours while adhering to the compliance guidelines

3x faster operationalization of new data pipelines

Impact

The solution also enabled complex data analytics and ML data modeling on claims data in the cloud by exposing an API for efficiently querying and retrieving data in the cloud. By ensuring independent scalability of computing and storage, the client could reduce data platform operational costs by 35% across all use cases. It also facilitated the operationalization of new data pipelines 3x faster, leveraging the platform’s out-of-the-box processes and components.

Business Benefits

Operationalize Power BI reporting for clients
Meet the desired data availability SLA of 24 hours post-processing the monthly transaction data on SQL Server
Increase data availability to 99.9999999999% (12 9’s) over a given year
Reduce data storage and management cost by 50%

Choose a lab aligned to your Data & AI journey

Address your desired use case across critical analytic dimensions

DESIGN LAB

Explore architecture options with experts
Ensure strategic alignment of business and technology
Architect an ideal solution for a pressing problem

Get Started with a Design Lab for $0

BUILD LAB

Validate new or refactor existing architecture
Develop a prototype with expert guidance
Establish a roadmap to production

Get Started with a Build Lab for $20K USD

Learn more about Data & AI Labs

Cookie	Duration	Description
__cf_bm	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
_grecaptcha	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
_GRECAPTCHA	179 days	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
CONSENT	2 years	Used to detect if the visitor has accepted the marketing category in the cookie banner. This cookie is necessary for GDPR-compliance of the website.
li_gc	179 days	Stores the user's cookie consent state for the current domain.
pa_enabled	1 day	Determines the device used to access the website. Th is allows the website to be formatted accordingly.
rc::a	1 day	This cookie is used to distinguish between humans and bots. This is beneficial for the website, in order to make valid reports on the use of their website.
rc::b	1 day	This cookie is used to distinguish between humans and bots.
rc::d-15#	1 day	This cookie is used to distinguish between humans and bots.
test_cookie	1 day	Used to check if the user's browser supports cookies.
visitorId	1 year	Preserves users states across page requests.

Cookie	Duration	Description
_cc_cc	1 day	Collects statistical data related to the user's website visits, such as the n umber of visits, average time spent on the website and what pages have been loaded. The purpose is to segment the website's users according to factors such as demographics and geographical location , in order to enable media and marketing agencies to structure and understand their target groups to enable customised on line advertising.
_gcl_au	3 months	Used by Google AdSense for experimenting with advertisement efficiency across websites using their services.
ads/ga-audiences	1 day	Used by Google AdWords to re-engage visitors that are likely to convert to customers based on the visitor's on line behaviour across websites.
bcookie	1 year	Used by the social networking service, LinkedIn , for tracking the use of embedded services.
bscookie	1 year	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
demdex	179 days	Via a unique ID that is used for semantic content analysis, the user's n avigation on the website is registered and linked to offline data from surveys and similar registrations to display targeted ads.
dpm	179 days	Sets a unique ID for the visitor, that allows third party advertisers to target the visitor with relevant advertisement. This pairing service is provided by third party advertisement hubs, which facilitates real-time bidding for advertisers.
IDE	1 year	Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.
lang	1 day	Set by LinkedIn when a webpage contains an embedded "Follow us" panel.
lidc	1 day	Used by the social networking service, LinkedIn, for tracking the use of embedded services.
lpv#	1 day	Used in context with behavioral tracking by the website. The cookie registers the user’s behavior and navigation across multiple websites and ensures that no tracking errors occur when the user has multiple browser-tabs open.
pagead/1p-user-list/#	1 day	Tracks if the user has shown interest in specific products or events across multiple websites and detects how the user navigates between sites. This is used for measurement of advertisement efforts and facilitates payment of referral-fees between websites.
pixel.gif	1 day	Collects in formation on user preferences and/or interaction with web-campaign content - This is used on CRM-campaign -platform used by website owners for promoting events or products.
site/#	1 day	Unclassified.
ssi	1 year	Registers a unique ID that identifies a returning user's device. The ID is used for targeted ads.
u	1 year	Collects data on user visits to the website, such as what pages have been accessed. The registered data is used to categorise the user's interest and demographic profiles in terms of resales for targeted marketing.
UserMatchHistory	29 days	Ensures visitor browsing-security by preventing cross-site request forgery. This cookie is essential for the security of the website and visitor.
visitor_id#	10 years	Used in context with Account-Based-Marketing (ABM). The cookie registers data such as IP-addresses, time spent on the website and page requests for the visit. This is used for retargeting of multiple users rooting from the same IP addresses. ABM usually facilitates B2B marketing purposes.
visitor_id#-hash	10 years	Used to encrypt and contain visitor data. This is necessary for the security of the user data.
VISITOR_INFO1_LIVE	179 days	Tries to estimate the users' band width on pages with integrated YouTube videos.
w/1.0/cm	1 day	Presents the user with relevant content and advertisement. The service is provided by third-party advertisement hubs, which facilitate real-time bidding for advertisers.
YSC	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt-remote-cast-available	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-cast-installed	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-device-id	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-fast-check-period	1 day	Stores the user's video player preferences using embedded YouTube video.
yt-remote-session-name	1 day	Stores the user's video player preferences using embedded YouTube video.
yt.innertube::nextId	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt.innertube::requests	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
yt.innertube::requests	1 day	Registers a unique ID to keep statistics of what videos from YouTube the user has seen.
ytidb::LAST_RESULT_ENTRY_KEY	1 day	Stores the user's video player preferences using embedded YouTube video.

Cookie	Duration	Description
__utm.gif	1 day	Google Analytics Tracking Code that logs details about the visitor's browser and computer.
__utma	2 years	Collects data on the number of times a user has visited the website as well as dates for the first and most recent visit. Used by Google Analytics.
__utmb	1 day	Registers a timestamp with the exact time of when the user accessed the website. Used by Google Analytics to calculate the duration of a website visit.
__utmc	1 day	Registers a timestamp with the exact time of when the user leaves the website. Used by Google Analytics to calculate the du ration of a website visit.
__utmt	1 day	Used to throttle the speed of requests to the server.
__utmz	6 months	Collects data on where the user came from, what search engine was used, what link was clicked and what search term was used. Used by Google Analytics.
_omappvp	11 years	This cookie is used to determine if the visitor has visited the website before, or if it is a new visitor on the website.
_omappvs	1 day	This cookie is used to determine if the visitor has visited the website before, or if it is a new visitor on the website.
ab	1 year	This cookie is used by the website’s operator in context with multi-variate testing. This is a tool used to combine or change content on the website. This allows the website to find the best variation /edition of the site.
AnalyticsSyncHistory	29 days	Used in connection with data-synchronization with third-party analysis service.
omVisits	1 day	This cookie is used to identify the frequency of visits and how long the visitor is on the website. The cookie is also used to determine how many and which subpages the visitor visits on a website – this in formation can be used by the website to optimize the domain and its subpages.
omVisitsFirst	1 day	This cookie is used to count how many times a website has been visited by different visitors - this is done by assigning the visitor an ID, so the visitor does not get registered twice.
pa	1 day	Registers the website's speed and performance. This function can be used in context with statistics and load-balan cing.
ziwsSession	1 day	Collects statistics on the user's visits to the website, such as the number of visits, average time spent on the website and what pages have been read.
ziwsSessionId	1 day	Collects statistics on the user's visits to the website, such as the number of visits, average time spent on the website and what pages have been read.