Building serverless ETL pipelines on AWS

Technological advancements in the past decade have transformed the software development landscape significantly. Cloud services like Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) have led enterprises to sunset physical hardware and operating systems, respectively. Similarly, serverless computing has simplified deploying code into production.

Serverless computing cloud providers like Amazon Web Services (AWS) run the server and infrastructure required for computation, data storage, routing, event notification, and visualization for data applications. AWS provides a suite of fully managed services through a pay-as-you-use model to build and run business applications, including capacity planning, scaling, and maintenance.

This blog describes how a leading digital solutions enterprise migrated from legacy data processing pipelines to a high performant, scalable cloud solution to minimize pipeline cost and data ingestion time.

The organization, which provides data-driven insights to personalize customer experiences, was facing many challenges with legacy data processing pipelines:

  • A monolithic architecture, which restricted multi-tenancy support
  • Manual triggers to poll raw data from the FTP server
  • Manual intervention for data fallouts and report generation
  • Tightly coupled architecture, which impacted flexibility and reusability

They were looking for a scalable, multi-tenant, performant, flexible, and fault-tolerant solution.

Impetus Technologies Inc. proposed building a serverless ETL pipeline on AWS to create an event-driven data pipeline. To migrate the legacy pipelines, we proposed a cloud-based solution built on AWS serverless services.

The solution provides:

  • Data ingestion support from the FTP server using AWS Lambda, CloudWatch Events, and SQS
  • Data processing using AWS Glue (crawler and ETL job)
  • Failure email notifications using SNS
  • Data storage on Amazon S3

Here are some details about the application architecture on AWS.

Serverless application architecture built on AWS

1) Data ingestion

The enterprise wanted a cloud-based solution that would poll raw data files of different types and frequency from multiple FTP locations with pre-defined configurations. The solution needed to support file validity at the FTP server to poll only configured files, a valid data size, and row count before uploading to S3. It also needed to send an email notification in case of any failure.

Solution:

  • Using CloudWatch Events rule with Lambda to poll FTP servers
  • Configuring CloudWatch to invoke Lambda function hourly to check for new data files on the FTP server
  • Storing the FTP server configuration and raw data file details in DynamoDB table, which is used by Lambda

2) Data validation

Once the data is ingested, AWS Lambda is used to uncompress, decrypt, and validate raw data files before uploading them to S3. The fault-tolerant solution can reprocess failed events and uses the dead letter (DL) queue (SQS) to track these failures.

The solution can track pre-configured SLAs to notify any delay in FTP files and send an email notification through SNS for event success or failure. DynamoDB is used to track raw file metadata, SLA information, and configurations used to ingest data into the system. Two Lambda functions decouple data polling and processing to support multi-tenancy and overcome the Lambda execution timeout of 15 minutes.

3) Data discovery and transformation

In line with data ingestion requirements, the pipeline crawls the data, automatically identifies table schema, and creates tables with metadata for downstream data transformation. The ETL job performs various operations like data filtering, validation, data enrichment, compression, and stores the data on an S3 location in Parquet format for visualization.

The AWS Glue crawler populates schema with configured frequency and tables with columns and partitions into the AWS Glue Data Catalog. CloudWatch Events and Lambda trigger AWS Glue ETL jobs to perform various data transformations. The AWS Glue Data Catalog is compatible with Apache Hive Metastore and can directly integrate with Amazon EMR, and Amazon Athena for ad hoc data analysis queries.

This serverless architecture enabled parallel development and reduced deployment time significantly, helping the enterprise achieve multi-tenancy and reduce execution time for processing raw data by 50%. In addition, real-time email notifications enabled them to take timely action in the event of any failure.

An AWS Advanced Consulting Partner, Impetus Technologies has helped several Fortune 100 enterprises achieve their cloud transformation goals. With accelerators for adoption and management, and an experienced team, we can help you accelerate data lake creation (in days versus months), large-scale migration (in months versus years), and management of workloads on the cloud to reduce time-to-market and overall costs. We can also help you with the right technology choices, engineering, and implementation across cloud providers and domains. Contact us to know more.

Author
Mukesh Kumar Kulmi
Data Engineer