Blog
29 Aug 2018

Why Apache Spark is the Antidote to Multi-Vendor Data Processing

The open source landscape has evolved.

Organizations today have access to a whole gamut of tools for processing massive amounts of data quickly and efficiently. Among multiple open source technologies that provide unmatched data processing capabilities, there’s one that stands out as the frontrunner − Apache SparkTM.

Apache Spark is gaining acceptance across enterprises due to its speed, iterative computing, and better data access. But for organizations grappling with multiple vendors for their data processing needs, the challenge is bigger. They’re not just looking for a highly capable data processing tool, they’re also looking for an antidote to multi-vendor data processing.

Enterprises have successfully tested Apache Spark for its versatility and strengths as a distributed computing framework that can handle end-to-end needs for data processing, analytics, and machine learning workloads.

Let’s find out what makes Apache Spark the enterprise backbone for all types of data processing workloads.

Apache Spark is the Antidote to Multi-Vendor Data Processing

Author
Punit Shah
Director of Engineering - StreamAnalytix
Blog
25 Dec 2018

Move away from batch ETL with next-gen Change Data Capture

As data volumes continue to grow, enterprises are constantly looking for ways to reduce processing time and expedite insight extraction. However, with traditional databases and batch ETL processes, performing analytical queries on huge volumes of data is time-consuming, complex, and cost intensive.

Change Data Capture helps expedite data analysis and integration by providing the flexibility to process only the new data through a smart data architecture, thus incrementally computing updates without modifying or slowing down source systems. This enables enterprises to reduce overheads, improve hardware lifetime, and ensure timely data processing without facing the limitations of batch processing.

That probably sounds like an oversimplified solution to a very complex problem. So, let’s break it down.

Data extraction is integral to data warehousing and data is often extracted on a scheduled basis in bulk and transported to the data warehouse. Typically, all the data in the data warehouse is refreshed with data extracted from the source system. However, an entire refresh involves the extraction and transportation of huge volumes of data and is very expensive in both resources and time. This is where Change Data Capture plays an important role by replicating data in a data lake with incremental computing updates without modifying or slowing down source systems.

This makes it easy to structure and organize change data from enterprise databases to provide instant insights. StreamAnalytix allows you to ingest, blend, and process high velocity data streams as they arrive, run machine learning models, and train and refresh models in real-time or in batch mode. There are two parts to a Change Data Capture solution with StreamAnalytix:

1. One-time load from source tables to the data lake

To achieve the ‘one-time load’ of all source tables into the data lake, StreamAnalytix Batch jobs on Apache Spark can be built for the purpose. StreamAnalytix Batch supports bulk read from any JDBC compliant database and perform ETL/Data Validation checks on the records before writing into the data lake (see example pipeline below).

StreamAnalytix batch pipeline

Image 1: A one-time load StreamAnalytix batch pipeline

One of the unique features of StreamAnalytix, ‘Templates’, allows a user to create a template pipeline (as shown above) and replicate it (create instances) for multiple source tables with a few clicks. This increases the productivity significantly without the need to write/manage hundreds of pipelines, one for each source table.

There are a few phases involved in an end-to-end Change Data Capture implementation:

  1. Capture the change data in real-time
  2. Perform any data quality/validation on the change data
  3. Merge the CDC data into the lake via:
    • Running ACID Hive queries
    • Land the change data into a staging zone (HDFS) and run reconciliation jobs

CDC solution
Image 2: Phases of a complete CDC solution

2. Real-time synchronization of Change Data Capture events into the data lake

To capture the change data in real-time, there are more ways. Let’s look at two more popular approaches:

LogMiner approach for Oracle database

In Oracle database, all changes made to user data or to the database dictionary are recorded in the Oracle redo log files. Since LogMiner provides a well-defined, easy-to-use, and comprehensive relational interface to redo log files, it can be used as a tool for capturing change data in near real-time.

To run LogMiner, it requires an open database in ArchiveLog mode with archiving enabled. Also, by default the redo log files do not contain all the necessary information required to construct a CDC record. To log the required information in the redo log files, at the minimum ‘primary key’ or ‘identification key’ must be enabled for supplemental logging for every table of interest.

StreamAnalytix and LogMiner

StreamAnalytix has the capability to read CDC data from Oracle using LogMiner approach. StreamAnalytix agent will capture CDC records and stream them in near real-time into Kafka. Once these events are in kafka, the StreamAnalytix streaming application will read them and load it into a staging area in HDFS (as external HIVE source table) after performing any ETL/enrichment/validations etc.

StreamAnalytix agent capturing events from Log Miner and streaming it into Kafka
Image 3: StreamAnalytix agent capturing events from Log Miner and streaming it into Kafka

Streaming pipeline reading CDC events
Image 4: Streaming pipeline reading CDC events and landing it into staging area

A StreamAnalytix batch job subsequently runs the ACID hive merge queries to update the master table at a periodic interval. StreamAnalytix scheduling capabilities help schedule the batch job to run every so often.

Third party CDC software/agent

An alternate option to LogMiner is leveraging other commercial products to capture the change data events in real-time. Attunity is one such product that we have used in the past. StreamAnalytix has built-in connectors to read the data from Attunity. We receive two streams from attunity: the metadata about the tables and the change data as it occurs. Once StreamAnalytix reads the change data events, it lands it into the staging zone from where the batch pipeline runs the Hive ACID transaction queries to update the master table.

Why implement a CDC solution?

CDC
Image 5: Four reasons to implement a CDC

Synchronizing data between systems has always been a massive undertaking. While Change Data Capture makes it easy by letting you know what to do on the destination before the data even leaves the source, it’s important to call out its key benefits. Change Data Capture allows you to:

  • Use commodity data hardware and avoid redundant data replication and delivery
  • Improve operational efficiency with real-time change data capture from transactional systems
  • Enable real-time data synchronization across the organization
  • Eliminate the need for batch windows by replacing bulk load updates with continuous streaming
Author
Punit Shah
Director of Engineering - StreamAnalytix
Blog
25 Jun 2019

Build, test, and run your Apache Spark ETL and machine learning applications faster than ever

Start building Apache Spark pipelines within minutes on your desktop with the new StreamAnalytix Lite.

Manually developing and testing code on Spark is complicated and time-consuming, and can significantly delay time to market. A visual low-code solution, on the other hand, can simplify and accelerate Spark development.

StreamAnalytix Lite, a light-weight, self-service data flow, and analytics platform, has transformed the experience of developing and running Spark applications, making it visual, faster and easier - right on your desktop, at no cost.

StreamAnalytix Lite, a developer edition of the StreamAnalytix Enterprise Edition, retains all its capabilities and features to build, test and run enterprise-grade Spark applications 10x faster vs. hand coding. It offers an intuitive drag-and-drop visual interface to instantly transform your journey with Spark on a desktop or a single node.

With its new release, StreamAnalytix Lite has further enhanced Spark development with a richer set of connectors, 150+ built-in Spark operators, interactive development, enhanced self-service features, and higher collaboration for multiple users. It also offers additional features like auto-schema detection, test suite support, user recommendations, and error detection at design time, use of Notebooks, and more.  Further, it enables hand-written custom logic in the language of your choice (Java, Scala, Python).

StreamAnalytix Lite comes with an array of support documentation, resources, and built-in sample pipelines to onboard and expedites your journey with the platform. Some critical features offered by StreamAnalytix Lite are:

  • Build and run enterprise-grade Spark applications on your desktop: End-to-end application life cycle management - build, test, debug, deploy, and manage in a unified platform
  • End-to-end ETL capabilities: Visually perform data cleansing, data blending, and data enrichment to transform batch as well as streaming data
  • Built-in advanced analytics and machine learning capabilities: Use built-in analytical operators like Spark MLlib, Spark ML, PMML, TensorFlow, and H2O
  • Visual and interactive development: Use an intuitive drag-and-drop interface, built-in processors, and a visual pipeline designer
  • Self-service platform to create data flows: Interact with the data as you build your data flows, leverage auto schema detection and data profiling, get auto-generated user recommendations, and more
  • Get all Spark features in one unified development tool: A wide array of built-in Spark operators for data sources, transformations, machine learning, and data sinks. Support for Spark 2.3 and Spark Structured Streaming makes it easier for customers to build production-grade continuous applications allowing users to handle out-of-sync data better, maintain greater consistency within their data streams and more efficiently join streams with static data sources.
  • Use powerful multi-tenancy features: Multiple users can connect to a single instance through a web-based interface

Experience the ease of Spark application development with StreamAnalytix Lite

The screen below shows the pipeline designer and the “Inspect” feature of StreamAnalytix Lite where a developer builds and iteratively validates a Spark pipeline by injecting sample test records and seeing the data changes at each step of the flow.

Who should use StreamAnalytix Lite

Developers, business analysts, data scientists, and DevOps specialists can use StreamAnalytix Lite to build unlimited Spark workflows on their desktop (Windows, Mac, or Linux) or any single node.

Recommended StreamAnalytix Lite usage

StreamAnalytix Lite is learning, experimentation, and development tool to make Spark development easy for a wide range of users. It is not recommended as an execution platform for production applications. Data processing applications or pipelines built on StreamAnalytix Lite can be seamlessly exported to the production grade (Enterprise) edition of the StreamAnalytix platform to run at full enterprise scale in production on multi-node Spark clusters.

About StreamAnalytix

StreamAnalytix is an enterprise-grade, visual, self-service data flow and analytics platform for unified streaming and batch data processing based on the best-of-breed open source technologies. It supports the end-to-end functionality of data ingestion, enrichment, machine learning, action triggers, and visualization. StreamAnalytix offers an intuitive drag-and-drop visual interface to build and operationalize data applications five to ten times faster, across industries, data formats, and use cases.

Blog
07 Aug 2019

Boosting customer experience with real-time streaming analytics in the travel industry

A large US-based airline use case

A recent study by Harvard Business Review revealed that 60% of enterprise business leaders believe real-time customer analytics is crucial to provide personalization at scale. According to the study, the number is expected to increase to 79% by 2020.

As mobile-first becomes the driving force for customer experiences, airlines are battling it out to make every customer journey personalized and customized in real-time. With every customer choosing different channels to interact at different points in their engagement journey, the challenge for airlines is to tap all these points of interaction to create an experience that’s personalized and relevant.

As passenger data becomes more readily available, airlines can have a granular insight into individual travelers to create tailored services for specific groups. Real-time use of this data can boost revenues, improve customer satisfaction, and enable proactive resolution of issues raised with the contact center.

Did you know?
  • 36% of consumers are willing to pay more for personalized experiences
  • Nearly 60% of consumers believe that their travel experience should deploy the use of AI and base their search results on past behaviors and personal preferences
  • 50% of global travelers say that personalized suggestions for destinations and things to do encourage them to book a trip
Putting your customer data to work

Airlines have a wealth of reliable customer data, but most of them aren’t using it to their full advantage. While there is an increased focus on improving customer experience, airlines need a more robust and scalable infrastructure to cope with the 3Vs: volume, variety, and velocity.

Traditional systems fall short of providing a real-time customer view and connecting it with historical customer information. That is where real-time customer 360 plays a vital role by providing a deep understanding of the state of the customer at the moment of your current interaction with the context of the entire past.

The StreamAnalytix Advantage

StreamAnalytix makes it possible to ingest and manage high volumes of data in seconds to minutes, which otherwise takes days or weeks to harness using a traditional technology stack. Using a scalable distributed architecture, StreamAnalytix enables support for even larger datasets coming in at higher speeds. The platform provides a visual interface for smooth and easy onboarding of additional services and creation of applications as part of the customer 360 journey.

A major airline was struggling to efficiently manage, analyze, and draw actionable real-time insight from its continuously growing and complex customer and operational data. StreamAnalytix helped the airlines to enhance the customer experience across various channels and point of interactions by:

  • Increasing capacity to perform data searches and pattern analysis using an extended time window [15 days]
  • Proactively alerting and analyzing log data to detect website and mobile app outages in real-time
  • Applying built-in predictive models and machine learning algorithms on customer data to predict customer preferences and choices, resulting in more contextual interactions and personalized offers.
  • Offering proactive insight to contact center representatives to quickly resolve incoming requests, leading to higher conversions and enhanced customer satisfaction.

To know more about how StreamAnalytix helped this top US airline boost real-time customer experience across channels, read this case study

.
Blog
02 Sep 2019

Detect and prevent insider threats with real-time data processing and machine learning

Insider threats are one of the most significant cybersecurity risks to banks today. These threats are becoming more frequent, more difficult to detect, and more complicated to prevent. PwC’s 2018 Global Economic Crime and Fraud Survey reveals that people inside the organization commit 52% of all frauds. Information security breaches originating within a bank can include employees mishandling user credentials and account data, lack of system controls, responding to phishing emails, or regulatory violations.

Ignoring any internal security breach poses as much risk as an external threat such as hacking, especially in a highly regulated industry like banking. Some of the dangers of insider threats in the banking and financial industry include:

  • Exposing the PII information of the customers
  • Jeopardized customer relationship
  • Fraud
  • Loss of intellectual property
  • Disruption to critical infrastructure
  • Monetary loss
  • Regulatory failure
  • De-stabilized cyber assets of financial institutions

Identifying and fighting insider threats requires the capability to detect anomalous user behavior immediately and accurately. This detection presents its own set of challenges such as appropriately defining what is normal or malicious behavior and setting automated preventive controls to curb predicted threats.

How can real-time analytics and machine learning platform like StreamAnalytix help detect insider threats?

Ingestion and data processing from many critical applications, at a fraction of the cost

StreamAnalytix enables ingestion from many applications and blends incoming high-speed data with static data sources. It further uses Apache Kafka, that allows the platform to ingest data at a ten times lower infrastructure cost and at a significantly higher speed from tens of thousands of discrete internal systems. For instance, StreamAnalytix helped a large bank in the US to ingest data from up to 90% of all its mission-critical applications to detect threats, which was 5x more applications compared to the existing solution, and at 4x the speed of the older technology stack with lower hardware infrastructure cost.

Data transformation in real-time

StreamAnalytix enables in-memory data transformation and distributed in-memory stateful processing that allows faster data quality scoring, data cleansing, and data enrichment. StreamAnalytix enabled the bank with the following capabilities in its insider threat detection journey:

  • Real-time data quality scoring and auto-cleansing
  • Data deduplication over seven days of history, which helps to curb false positives, narrowing the flags to relevant suspicious behavior and activity
  • Enriching event records with employee and application data
  • Executing data transformations

Use of machine learning models for automated, continuous, and accurate anomaly detection

StreamAnalytix enables the use of machine learning to move away from static rule-based alerts to dynamic models. These models periodically learn normal baseline behavior and detect anomalies based on both dynamic and static factors such as identities, roles, and access permissions; correlated with log and event data.

Models developed using built-in machine learning operators in StreamAnalytix include self-learning and training behavioral profile algorithms, which help in processing new transactions in real-time to build risk scores and dynamic thresholds for various risk factors.

Use of machine learning proved highly effective in reducing false positives and highlighting behavior that genuinely accounts for malicious activities.

Custom alerts to curb fraud in real-time

Appropriate real-time alerts and actions are critical to prevent predicted breaches. The StreamAnalytix platform sets up routine rule-based alerts like off-hours activity, multiple-failed logins, multi-station logins, and custom-alerts for ‘suspicious’ activity (based on a complex mix of factors deduced by the machine learning algorithms) which could be manually validated by security experts.

The StreamAnalytix Advantage

StreamAnalytix has helped a large bank to identify and prevent insider information security threats across sensitive applications in its retail banking and wealth management divisions. StreamAnalytix boosted insider threat detection by 5x through use of predictive analytics and machine learning on an extensive data set from highly sensitive applications to automatically and effectively detect previously unknown threat scenarios and patterns and raise appropriate alerts and actions to prevent predicted breaches.

To know more about how StreamAnalytix helped a large US bank boost threat detection, read this case study.