Streaming Analytics: How to Understand a Crowded Marketplace
First things first: A Definition
Streaming analytics is an emerging software space that has gained considerable prominence due to the increasing importance of big data, and specifically the Internet of Things (IoT). But what is it? Streaming analytics software makes it possible to analyze data in real time, as it happens, by plunging into the stream of live data and analyzing it in flux.
This is profoundly different from analyzing data at rest. The entire business intelligence space has been working on the data at rest analysis problem for decades and the possible approaches are well understood. But the techniques for analyzing data in motion are relatively new. Essentially, the software constantly calculates statistical analytics while moving with the stream of data. The value of this real-time data is transient, and insights must be acted on almost immediately to gain real value.
Some Examples of How these Products are Used
Streaming analytics technology was first adopted by the finance and capital markets industry. In capital markets, real-time data is a crucial tool to understand what is happening right now and take appropriate action. For example, a streaming analytics model might watch market data streams with instructions to take specific action if certain conditions are met. For example, if the spread between two stocks deviates by more than a certain percentage in any five second period, trade the stocks immediately.
But streaming analytics has spread far beyond the world of stock trading. Today this technology is used by energy firms to monitor thousands of real-time data points to help predict oil pump failure before it happens. Similarly, computer chip manufacturing fabs monitor hundreds or thousands of processes in real time in order to predict potential manufacturing anomalies. This technology is also increasingly being used by consumer goods marketing departments. One common use case is to monitor sentiment over social media in real time in order to provide insights on who to target in upcoming campaigns.
Different Kinds of Streaming Analytics Technology
I asked Roger Rea, Senior Product Manager for IBM Streaming Analytics technology to describe the various options:
“The streaming analytics category is a very crowded market and might best be thought of as an evolution of complex event processing technology (CEP). CEP engines were first created for use cases related to capital markets such as stock trading algorithms where they must generate a response within milliseconds. Speed is really everything here. Streaming analytics applications, by contrast, tend to have higher latency but are designed for the very high throughput of the big data era.”
The raw technology used in streaming tools differs and there are three main technologies underlying streaming analytics capabilities.
SQL-Based:
Many CEP tools and some streaming analytics vendors use SQL as the underlying analytics engine. In 2016, the emergence of various flavors of streaming SQL have made SQL a very viable option. Although, as of today there is no standard streaming SQL syntax, most streaming processors like WSO2 and the various Apache streaming tools, all now support streaming SQL. TIBCO Streambase, for example, uses a version of SQL called StreamSQL which has been adapted to the analysis of data streams in flux. SQL now provides a simple and effective language to program streaming use cases.
Rule-Based:
Another option is the Inference Rule-Based engine approach, which is also more commonly seen in CEP products. These systems rule-based systems work by using an if-then-else logic. If a record comes in, it is compared to the rules and then, if appropriate, some action is triggered. This approach works best with highly structured data.
Programmatic:
The third principal technology is programmatic, meaning that code must be written in a standard development language like Java or C, or a proprietary language like IBM Streams Processing Language.
Open Source vs Proprietary Tools
There is a large number of open-source technologies in the streaming analytics arena including notably Apache Flink, Apache Spark Streaming, Apache Samza, Apache Kafka Streaming, Apache Storm, Google Dataflow, Twitter Heron, and AirBnB StreamAlert. All of these tools (and most other open-source tools) originally used the programmatic logic approach, but now support streaming SQL too. There is also a number of proprietary streaming analytics platforms like IBM Streaming Analytics, Vitria Operational Intelligence, Microsoft Azure Stream Analytics, and SAS Event Stream Process and TIBCO Streambase.
One obvious question then is why use a proprietary tool over the long list of available open-source tools?
I posed this question to Manish Patel, Senior Product Manager for TIBCO’s streaming analytics technologies. In his view, there are two reasons: scale and specialization.
Scale:
Open-source streaming analytics tools are focused on smaller projects. Manish says:
“For a POC firms get started with these technologies, but their time to market is quite expensive and the learning curve can be long. These tools require the use of a programming language, and they don’t provide visual tooling to help develop applications faster”.
Once the company validates the idea, and they want to move something into production, at that stage, they generally move to commercial vendors. One reason for this is stability. Open-source tools tend to change quickly as the code base is being constantly modified.
A TIBCO StreamBase user comments in a review on TrustRadius:
“I have considered Apache Spark Streaming and Apache Flink. Spark Streaming is still changing too often for my taste and does not seem as easy to connect to IoT data”.
Specialization:
In order to differentiate themselves from each other, open source tools have become increasingly specialized. Manish describes them as follows:
“Firstly, there are event processing tools for doing correlation abstraction, aggregation, and filtering; Second are stateful computing platforms like Apache Spark Streaming, and Apache Flink and Samza. Third is the set of tools focused on ingestion and processing of data flows and ETL like Apache Kafka and NiFi.”
While this specialization has benefits, it also adds complexity since several different tools are required for an end-to-end solution.
Proprietary platforms like IBM Streaming Analytics, TIBCO StreamBase and Apama Streaming Analytics usually provide capabilities across the entire ecosystem. Large vendors may provide portfolios of tools in conjunction with the streaming platform like visual analytics, advanced numerical analysis, collaboration capabilities, and other functionality. Proprietary vendors say that this portfolio approach can reduce project complexity in the long run.
However, open-source tools have advantages too. Not only are they usually free and open, but these days, groundbreaking technologies often debut as open-source, and innovative new technology is more likely to be open-source than not. In particular, open-source tools usually dominate discussion at the infrastructure layer. Open source products are also being incorporated as components of products from large proprietary vendors like IBM, Microsoft and others.
Trends
As in many technology domains where significant expertise is required to derive value, streaming analytics technology is in the process of being simplified to extend its user base beyond the IT department. The goal is for business users to create real-time models without any help from a developer by using visual tooling and clicks.
One TIBCO StreamBase review mentions that this trend is already partly realized:
“Rapid development, drag-drop with minimal coding. UI is easy to understand and build.”
An IBM Streaming Analytics review looks forward to the availability of non-proprietary visual development tools:
“In order to get the absolute most out of the platform IMO it’s still best to develop applications using proprietary SPL (Stream Programming Language). Although SPL is a very effective language for stream processing it does present a barrier to entry that will be avoided with the updated visual development tools which being worked on”.
We can expect a lot of development around simpler tooling in the future.
A second trend is the expansion of the Business Intelligence domain to include real-time streaming data. Traditionally, business intelligence tools are designed to look back at historical data to understand what happened in the last quarter or the last year. Increasingly though, data analysts want to see, for example, real-time sales data so that they can get a view on how they are performing right now compared to historical data. Many BI tools are scrambling to build real-time data stream analysis into their products. Zoomdata, for example, has had this capability for some time.