Streaming Data ETL with Impetus StreamAnalytix and Syncsort DMX – Guest Blog

Today we are announcing a partnership between Syncsort and Impetus Technologies, and our entry into an integration of batch processing and real-time stream processing that we call “Streaming ETL”. The mix of batch and real-time processing has also been referred to as the Lambda Architecture. Streaming ETL allows a mixing of the best batch and streaming technologies under the umbrella of tools which abstract the complexity of the underlying platforms.

The huge increase in types and sources of data has placed pressure on companies to blend and summarize that data quickly to create actionable information. A combination of real time and batch processing is needed to meet the new demands.

There’s a grab bag of technologies that excel in specific aspects: MapReduce, Storm and Spark for massively parallel processing; Kafka and Spark Streaming along with traditional messaging and queuing software for real-time data movement; Mesos and YARN for cluster management. These components can be mixed and matched, but there are many APIs to learn and different skill sets needed to leverage them well.

StreamAnalytix definition with Syncsort DMX bolt

DMX task that performs lookup for IP

Syncsort and Impetus Technologies abstract away the complexity

Impetus Technologies developed StreamAnalytix to ease the process of building, deploying and monitoring real-time Apache Spark applications. StreamAnalytix provides an abstraction over the complex technologies used in platforms like Storm and Kafka.

Syncsort DMX-h gives users an easy way to specify data transformations and data enrichment, and provides an extremely efficient, high performance run time engine. We have provided this engine to be used in StreamAnalytix streaming pipelines. The DMX engine is called whenever the user needs to transform or enrich the data.

Common use cases for Streaming ETL

  • A clickstream analyst requires an IP address in a web log to be converted to a decimal or hexadecimal value for downstream processing
  • A mobile app sends a product identifier used on a web retail site, and that identifier has to be transformed to the product ID used in the Enterprise Data Warehouse using a lookup
  • A hardware manufacturer sends real time testing data from many sites, which needs to be correlated with historical data

On the Syncsort side, we made only minor adjustments to enable the DMX run time engine to allocate less memory than it does for large batch operations, and to expose a Java interface to integrate with StreamAnalytix. The Impetus engineering team quickly added a component in their excellent Web UI to allow users to add one or more DMX transformations in any stream. The teams worked together to publish and interpret metadata from the two products.  In the coming months, we will be working to add features and enhance the integration.

I will be producing a screencast of this new Streaming ETL software, and both web sites will have more information about this exciting joint solution.   We look forward to hearing from you with any questions and feedback.

Gregory Grubbs
Product Manager, Cloud and Big Data Partners - Syncsort