19 Mar 2020

Modernize your ETL processes with StreamAnalytix

Businesses are struggling with huge volumes of data to solve complex business problems while relying on their legacy data platform infrastructure. However, traditional ETL tools that were designed two decades ago are not equipped to solve the business problems of 2020.

Challenges with traditional ETL Tools:

  • Costly
  • Non-flexible
  • Non-scalable
  • Built for on-premise
  • Cannot transform before landing

To address these challenges, enterprises are looking to transform their ETL workloads from legacy data warehouses to the cloud. The trend has gathered momentum in recent times. According to Forbes, 80% of the data warehouse tools used by organizations are now cloud-based versus on-premise.

Modern ETL tools have evolved as an obvious choice as they come packed with features to extract value from huge datasets. These tools offer the following advantages:

  • Performant
  • Resilient
  • Scalable
  • Intuitive
  • Secure and compliant
  • End-to-end data processing and analytics

How StreamAnalytix fits the bill?

StreamAnalytix is a self-service ETL and analytics tool. The platform lets you easily create batch and streaming ETL pipelines using drag-and-drop operators on a visual IDE. StreamAnalytix has a wide array of built-in operators for data sources, transformations, machine learning, and data sinks.

StreamAnalytix is the most advanced ETL tool to run your workloads on a distributed cloud environment with support for a wide variety of cloud-native operators.

Migrating traditional ETL jobs to StreamAnalytix

StreamAnalytix not only provides the most cutting-edge environment to run your migrated ETL jobs but also helps in migrating ETL jobs in three-steps:





Step 1 - Assessment

Assessment is an important step to enable the following:

  • Highlight differences between the source and target systems
  • Examples of source system logic recreated in the target system
  • Stakeholder sign-off

Step 2 - Conversion

Conversion involves moving the existing ETL logic to the target system. Traditional ETL workloads are transformed into Spark-based distributed workflows to be executed on StreamAnlaytix.

Step 3 - Validation

Ensuring a successful migration is crucial for business continuity. The tool’s validation capability ensures all the existing workloads are successfully migrated, and there are no gaps in logic that can result in loss of data when jobs are executed in the new environment.

Benefits of using StreamAnalytix for migration

  1. Drastically reduced migration efforts
  2. Increase in developer productivity
  3. Automated validation
  4. One-to-one mapping of existing workflows

To know more about how to migrate from traditional ETL workloads to distributed Spark-based workflows in StreamAnalytix, write to us at

Saurabh Dutta
Senior Solutions Architect - StreamAnalytix
30 Mar 2016

StreamAnalytix Releases 2.0 with support for Apache Spark Streaming

We feel proud to release StreamAnalytix 2.0 – the industry’s only multi-engine RTSA platform!

With a proven product based on Apache Storm, StreamAnalytix has taken a big step forward with the release of the product version 2.0 that also supports Apache Spark Streaming.

Why Multi-Engine?

Real-time analytics use-cases, today, are best optimized by utilizing different stream processing paradigms. Some use-cases require low latency, stateless processing of time-series data in motion or in flight in a distributed fashion, with a reliable and/or durable data source (like Apache Kafka), which Apache Storm could address well. In other use-cases where a stateful, reliable, micro batched, complex processing is involved, Apache Spark Streaming is the best fit.

Clearly, real-time analytics is not a true one-size-fits-all approach that will result best in performance, manageability and time-to-market. What works for one, may not work best for another!

StreamAnalytix 2.0 simplifies the trade-off by integrating multiple engines in a single platform. You can now run your applications on a stream processing engine of choice and not compulsion, depending on the use-case requirements, and without worrying about the underlying technology. Thus, we provide a new level of “best-of-breed” flexibility in your enterprise real-time architecture.

We’ve something to offer for everyone in this release:

As a Developer, you get a wide variety of built-in sources and sinks including TIBCO, ActiveMQ, IBM MQ, Amazon Kinesis and S3, and can extend the list with reusable custom operators. With features such as Sub-system Integration, you can easily interconnect multiple sub-systems which individually use different streaming engines, and Pipeline Versioning feature allows you to version a subsystem and rollback to a previous version any-time.

Data Scientists can increase their efficiency by using drag-and-drop operators for Predictive Analytics, MLLib, SparkSQL, Spark Data Transformation and a rich library of data processing functions. They can create models in the UI, test the model output visually, and refine it by blending streaming data with static – very easily, without any coding.

For IT Admin, we’ve made some core improvements in areas like Management, Monitoring and Configuration. Enhanced multi-tenancy controls now come with the ability to restrict resources for specific tenants and sub-systems.

Last but not the least, Business users can analyze the streaming data with all improved built-in Real-time Dashboards pre-configured with advance charts and graphs. With StreamAnalytix 2.0, we inch closer towards our goal to be a ‘zero-code’ platform and lead the wave of ‘build applications with clicks, not code’.

To learn more, watch the Demo Video, download the Datasheet, or try StreamAnalytix 2.0.

Saurabh Dutta
Senior Solutions Architect - StreamAnalytix
13 Dec 2018

How modern data science is transforming anomaly detection

Real-time anomaly detection has applications across industries. From network traffic management to predictive healthcare and energy monitoring, detecting anomalous patterns in real-time is helping businesses derive actionable insights in multiple sectors.

However, as data complexity increases, modern data science is simplifying and streamlining traditional approaches to anomaly detection.

How can today’s enterprises ride the modern data science wave to effectively address the evolving challenges of real-time anomaly detection? And what are the key differentiators businesses must look for, to identify a platform that meets their needs?

Let’s explore how modern data science is transforming anomaly detection as we know it.

Anomaly detection and data science

Saurabh Dutta
Senior Solutions Architect - StreamAnalytix
07 Jan 2019

Why Apache Spark is the right way to get a real-time customer 360 view for your business

A survey by Bain & Co. reveals that more than 89% of organizations believe that customer service plays a critical role in staying ahead of the competition. The key to transforming customer experience is having a consistent, complete, and real-time view of customers across systems and channels.

As customers interact with businesses from multiple devices and platforms, companies have huge data available from various sources like website analysis, search results, engagement applications, CRM systems, etc. Customers expect immediate responses to their needs with real-time relevance at every point of engagement.

One of the biggest challenges that any organization faces is having a unified view of their customers to understand what they want, at the right moment. While huge amounts of data flow into the system from multiple sources, often, the data is in silos, making it difficult to stitch it all together to create a complete picture of the customer.

What is Customer 360?

Customer 360 is a strategic approach to enable businesses to identify actionable insights from multichannel data to offer the best customer experience across all channels. By having a unified view of all customer touchpoints, customer 360 tracks the journey and experience of a customer with a business to stitch an end-to-end picture.

Having a real-time 360-degree view of the customer can help businesses to:

  • Personalize the customer experience
  • Deliver the right services at the right time
  • Predict customer behavior
  • Target new customers
  • Retain customers

Research shows that 25% of customers will defect after one bad experience. Customers accustomed to the personalization and ease of dealing with digital natives such as Google and Amazon now expect the same kind of service from established players.

While the expectation of customer to have an end-to-end satisfaction is valid, various factors contribute to playing spoilsport in having a ‘wow’ experience. While the reasons are many, some of them are:

  • Fragmented systems with no true single unified view
  • Processing workloads in batches and not in real-time
  • Scalability of systems to accommodate and process extensive data
  • Minimal application of machine learning applied to Customer 360
  • In-house talent still centered around traditional data warehouse

To address these challenges, Apache Spark is becoming a de-facto engine. It can help businesses build an accurate customer 360 view and to deliver compelling experiences now.

How can Apache Spark help with Customer 360?

With its ability to handle end-to-end needs for data processing, analytics, and machine learning workloads, Apache Spark has the following capabilities to be the right candidate to get a real-time customer 360 view for your business:

  • Provides a solid always on unified view of the past, present, and future
  • Capable of predictive and prescriptive modeling based on all customer signals including text and NLP
  • Accurate and trustworthy
  • Goes beyond data integration, offers complete information integration
  • Always ON system
  • Provides a view that is recent, comprehensive, relevant, and sensitive to privacy concerns

To learn more about how Apache Spark-based architecture addresses the data challenges of real-time customer 360: Watch the webinar - Transforming Real-time Customer 360 with Apache Spark

18 Mar 2019

Real-time analysis of weather impact on New York City taxi trips in minutes using StreamAnalytix

In this post, we will see how easy it is read data from a streaming source, apply data transformations, enrich data with external data sources and create real-time alerts in minutes with StreamAnalytix.

We will use the drag and drop interface and self-service features of StreamAnalytix to build a streaming pipeline (image 1) to analyze the impact of weather conditions on New York City taxi trips. This pipeline can be accessed and run on StreamAnalytix Lite, a free to download and use single node version of StreamAnalytix enterprise edition.

We will analyze two aspects; impact of weather conditions on the taxi trip (time taken to pick-up and drop-off the rider in co-relation to distance traveled), and the mode used to make payments (cash or card) to create alerts for cash payments beyond a set threshold.

Image 1

Step 1: Read data from source

Read data from Data Generator, a streaming data source.

Once you drag and drop Data Generator onto the canvas, right-click the operator to configure it. The configuration window will appear (Image 2).

  • Click Upload File to upload the data file containing the following data points for New York City taxi trips:
    • Pick-up time and location
    • Drop-off time and location
    • Number of passengers in the cab
    • Fare of the cab ride
    • Trip distance
  • Once the file is uploaded, click Next

Image 2

Step 2: Identify data schema

A schema identification window will appear (Image 3) driven from the auto-schema detect feature built-in StreamAnalytix platform.

Click Next to save this schema.

Schema derived from the auto-detection feature of StreamAnalytix can be edited to desired data type.

Image 3

Step 3: Apply data transformations

As you save the data schema, the data inspect window will appear below the pipeline canvas (Image 4). Use Inspect Display window to apply pre-processing transformations to the data and alter it as required.

In this pipeline, three transformations have been applied:

  • Filter
  • Rename
  • Date transformation

Image 4

Step 4: Enrich taxi trips data with weather conditions data

After applying the transformations, follow these steps:

  1. Import weather conditions data into the pipeline
  2. Join the data with rest of the pipeline using Spark SQL (StreamAnalytix allows you to write your SQL queries in-line in the operator to join data set).
  3. Persist the data using a File Writer.

Image 5

4. Right click on the ‘Spark SQL’ operator, a configuration window will appear (Image 6). Here you will see the ‘Weather Conditions Data’ is joined with the ‘Date’ of each taxi trip.

Image 6

5. Click Next.

The inspect display window will appear (Image 7) displaying weather conditions data (like min and max temperature, precipitation, wind, snow and more) corresponding to each taxi trip.

Image 7

Step 5: Process cab fare data for payment method used

To count the number of total payments made by card and cash, apply aggregator processor ‘Payment Type by Count’.

Image 8

Right-click Payment Type by Count.

The configuration window will appear (Image 9).

Configure the processor to:

  1. Count payments by different methods
  2. Fix a relevant time window for the aggregator processor
  3. Watermark the pick-up date and time
  4. Group results by ‘Vendor ID’ and ‘Rate Code ID’

Image 9

Step 6: Create real-time alert

Drag-and-drop the Alert processor to create an alert for cash payments exceeding certain number.

Image 10

  1. Right-click Alert. The configuration window will appear (Image 11).
  1. Input the desired number in the Criteria, exceeding which an alert for cash payments should be created.

Image 11

Step 7: Persist data

Use File Writer (Image 12) to persist the data.

Image 12

Right-click the File Writer operator to view the location where the file has been saved.

This concludes the pipeline. You can download StreamAnalytix Lite on your desktop (Mac, Linux, or Windows) and try building and running this pipeline yourself in minutes.

About StreamAnalytix Lite

StreamAnalytix Lite is a powerful visual IDE, which offers a wide range of built-in operators, and an intuitive drag-and-drop interface to build Apache Spark pipelines within minutes, without writing a single line of code

A free, compact version of the StreamAnalytix platform, it offers you a full range of data processing and analytics functionality to build, test and run Apache Spark applications on your desktop or any single node.