Today we are looking at the matchup of Near-Realtime data vs. Realtime data scenarios. What considerations go into handling those types of data sources? How do we process them, and how do we provide data visualizations based on those data sources that meet the needs of our audiences?
Realtime vs Near-Realtime Overview
Before we dive in, let’s get an overview of those Realtime data scenarios. By that, we mean where we need to evaluate data that’s in-flight so we can make determinations or predictions that could result in an intervention from a business standpoint.
For example, we might need to implement fraud detection by evaluating credit card transactions and then preventing that fraudulent sale from occurring. In those cases, it’s going to be very important that we look at that data in a Realtime scenario.
A Near-Realtime case would be if having up-to-date data available is a priority, but we don’t have that requirement to intervene. In those cases, we want to make sure our end users see “up-to-date” data that they recently entered in business systems reflected in their data visualizations, dashboards, and reports. By “up-to-date” we mean data that’s from a couple of minutes to fifteen minutes (or so) old. We may want some additional capabilities along with that Near-Realtime case (which we will cover later in this blog series).
In general, the need for Realtime versus Near-Realtime data largely depends on the cost of delayed action. Where that delayed action of a couple minutes to an hour could result in significant cost, then we want to apply Realtime data processing to those scenarios. If the cost of delayed action is not high, but we want additional analytical capabilities, we might choose a Near-Realtime scenario.
In the Microsoft space, there are a couple of different ways to provide Realtime and Near-Realtime capabilities. We show some examples of these in the infographic below. You can see in the top layer how we can use Realtime data processing, a Realtime pipeline that leverages tools like Event Hubs and Stream Analytics to move data quickly through to Power BI from your devices. Those could be IoT devices or even IIoT (Industrial Internet of Things) data from machines on the shop floor.
When you get into the two scenarios in the lower section, you may want to use different types of storage mechanisms like Azure Data Lake Store or Azure Synapse Analytics (formerly Azure SQL Data Warehouse) in conjunction with Azure Data Factory to do some processing. Those scenarios often include data coming out of ERP systems, point-of-sale systems, and manufacturing execution systems. They typically have more Near-Realtime data analytics capabilities that we want to enable.
This is just a high-level overview. The blog from here on is going to dig into some of the specific advantages and disadvantages of Realtime and Near-Realtime.
Pros of Near-Realtime Data Processing
First, let’s talk about some of the pros of Near-Realtime data processing and evaluation.
One of the big advantages to using Near-Realtime data processing is the ability to persist data, meaning store it somewhere that’s not coming directly off the stream. This also provides us the ability for data integration. Combining current data from multiple systems or sources or with other historical data can help us look at trends and improve forecasting. We can incorporate all of that into our data modeling in those Near-Realtime scenarios.
A second capability that’s enabled in Near-Realtime is the ability to look at larger windows of time to do historical analysis and possibly to do even more complex analysis: what we call cross-domain analysis. Those could be scenarios where we’re looking at how many orders we’ve taken that have ultimately converted into completed sales – invoices, for example. Even further upstream we could look at how many opportunities we’ve converted out of our CRM system into those orders and invoices.
Another potential advantage of Near-Realtime is that we can still have a very high speed related to how we refresh that data and make it available. Even though it may not be second or millisecond latency, we can still have a very tight time window as far as how we refresh that data – think minutes.
Cons of Near-Realtime Data Processing
These are some strong advantages, but Near-Realtime data processing also brings a couple potential drawbacks to the table: latency and the fact that we can only see historical data.
We typically see Near-Realtime latency as 5-15 minutes or longer. That’s due to the need to first persist the data and then process it. Persisting the data may require bringing it together from multiple data sources. Every time we perform an extraction, and then ultimately process that data (whether through an ETL process or a model processing step), that introduces latency. For very, very large models, the typical 5-15 minute latency could stretch into the 20-30 minute range.
I have historical listed as both a pro and a con. The advantage is that we get historical data analysis, and we can see those longer trends. What we don’t get (because it’s historical) is the ability to do that immediate intervention. We are limited to evaluating the data after an event has already occurred. You can’t have that on-demand type of Realtime intervention that you might want to have with data that’s coming across the wire.
Pros of Realtime Streaming
Near-Realtime data processing has a lot of good points, but some of those disadvantages make it necessary to look at other options. Below, I’ve laid out some of the potential advantages of Realtime streaming data.
First, we get very low latency: that is second or sub-second data availability and visualization based on that data. That has tremendous uses in cases where immediate decision-making and action are required. For example, when a shop floor operator or manager is tracking part quality or potential machine failures, seconds and minutes count and could be the difference between thousands of dollars of scrapped parts or damaged equipment.
Unlike when you are working with Near-Realtime data, with Realtime we can do that intervention. We can perform in-flight transaction evaluation, make recommendations, and potentially change outcomes. For example, an e-commerce company can use these real time analytics to improve customer experiences through personalized marketing based on their previous choices. This intervention option is very important when there’s a high value to immediate feedback or (conversely) a high price of delayed action.
Cons of Realtime Streaming
Like Near-Realtime, Realtime streaming has some drawbacks as well.
No Summarized Data
We can’t summarize data. In Realtime scenarios, if we’re doing any aggregation at all, we’re doing it over very small windows of time. We’re not capturing those individual data points for aggregation. There are some potential ways we can combine these two forms of processing (which I’ll cover more in a later blog), but the lack of summarized data can certainly be a Realtime drawback.
Lack of Persistence
Another disadvantage of Realtime is that real-time data streams are not persisted for deeper analysis. This can be a drawback for companies who want to use their data to help implement automation and influence machine learning.
No Complex Calculations
Lastly, from a calculation standpoint, it can be difficult to add any type of even moderately complex calculation logic. It’s often very difficult to do complex, averaging data evaluation relative to other transactions that have moved through our system. In many ways, we could say it’s even unsupported. There are some calculations that are supported, but very moderately complex calculation logic becomes difficult to do in a Realtime scenario.
In summary, choosing the right option depends on your use cases. Either way, solutions exist to address these scenarios and help you make better decisions at reasonable costs. If you need help evaluating specific scenarios and related technology approaches to manage them, reach out to Core BTS to see how we can help.