From batch to real-time reporting and analytics

In this fast world of digital transactions, mobile apps and sensor data, gone are the days when everyone can wait for hours to look at performance dashboards. Some parts of our business are demanding actionable insights in (close to) real-time:
  • Credit card fraud needs to be detected at-the-spot (rather than the next day) to limit exposure
  • E-commerce campaign success needs to be monitored in real-time to guarantee stock and an acceptable delivery lead time
  • Broadcasting networks such as TV and Radio stations want real-time digestion of social media feedback
  • Predictive maintenance needs to digest real-time information to avoid intraday production line failures
  • Intraday purchases need to be reflected within real-time in the warehouse planning systems to provide commercials teams with accurate stock coverage
  • An electricity grid needs to be monitored in real-time and intraday changes need to instantly update the forecasts
  • Transportation partners need a real-time view on the status of all busses/trucks/ships and expected delays
    From batch to real-time reporting and analytics
At the same time, the need for visual dashboards, clear KPIs and consistent numbers still very much remains: with all this data to digest business intelligence is more relevant than ever.
 
Note: next to this challenge to delivery at increased speed, the business often challenges the BI team to digest more data types (sensors, tweets, online data) and handle bigger volume. This velocity, variety, volume challenge lies at the center of what is generally pinned as the Big Data challenge. In this insight, well solely focus on the increased speed of delivery:
how to get started with
real-time.
 
Providing real-time analytics means a choice in system architecture. In this insight, we aim to outline all the options a BI manager/architect has, provide a review of pros and cons and finally deep-dive on our recommended option: leverage the new available techniques.

Option 0: Leverage the operational data store

A base option is to build real-time on top of the transactional (operational) system. This will only be possible if the information is registered in real-time in the operational database: e.g., is the sensor-data already streamed into our ERP system, credit card transactions are part of the transactional system.
 
With this option, all complexity will need to be taken care of in the operational data store rather than the business intelligence layer (e.g., can they handle real-time, how much incoming data can they manage). Additionally, the operational data store will need to provide some BI facilities: to clean data, apply transformations, apply business rules, etc.
 
This option includes consumption from the operational data store or a synced replica of the operational data store. Biggest disadvantage will be the inability to integrate multiple data-sources (which is where BI comes in).

Option 1: Add horse-power in our conventional data warehouse

The added value of a conventional BI data warehouse is first to integrate multiple data sources into a useful structure (facts and dimensions) and second to provide a clean view on both historical & current data.
 
Although a conventional data warehouse is not meant for real-time BI, a first real BI option is to extend the horse-power in a conventional data-warehouse towards the level needed to deliver (close to) real-time output.
 
Although this is feasible, the traditional approach of ETL (Extract, Transformation, Load) will always take time: i.e., the data is always stored prior to transformation and once more stored after transformation. A relational transactional database will always be dependent on ETL processes that execute on a fixed schedule. This ETL allows them to set a clean cut-off date for reporting and to reconcile all data prior to this date.
As such, there is a limit as to how frequently that an extract / load process can be instantiated and completed against a source system prior to being restarted.
 
The major downside of the approach will be the cost. To enable a conventional data-warehouse to recurrently rerun the ETL at (close to) instant speed, this approach will depending on the size of the data to handle - require a big to huge investment in CPUs and memory.

Option 2: Use synchronous database replication technology

A second option lies in database replication with synchronous transfer data between OLTP and data warehousing. With synchronous replication, the data warehouse will always be in-sync. However, as the ambition increases to deliver up to the second insights across large datasets, this set-up will cause significant pressure on the analytics repository.

Option 3: Complement existing architecture with new technologies

A third and recommended option is to leverage a set of new technologies to complement your traditional ETL process with a real-time streaming process. These technologies would need to do two things: first, receive streaming data and second, transform (in real-time) this data into relevant insights. Nowadays, all leading vendors have a solution in place which can handle both: e.g. Azure has Azure Eventhubs and Azure Streamanalytics

Receive streaming data

Source systems need to be able to stream their data towards a receiving end-point. This end-point will act as the doorway towards the real-time analytics and reporting.
A receiving end-point should be designed to cope with changing loads at efficient cost: i.e., at peak-times, capacity needs to grow as lots of messages come in; in low-times, capacity needs to shut down as few messages need to be digested.
 
The stronger solutions in the market e.g. Azure Event Hub have some additional features such as the ability to replay certain data streams from the past, save data up to 7 days, etc.
 
Most-used receiving endpoint solutions are the following (non-exhaustive list):
 
Figure 1 - Overview of real-time solution modules offered by leading vendors
From batch to real-time reporting and analytics
click to enlargeFrom batch to real-time reporting and analytics
 

Analytics on streaming data

Most data streams will still require some transformations before visualization: an aggregation (min, max, average), the scoring of a model, etc. To apply these transformations, we need technology able to leverage a distributed stream computational system which allows transformations without storing the data.
 
To write the transformations, most accessible solutions allow the user to write sql-like queries and set sliding time windows to compute the metrics e.g. an average every 10 seconds. Other solutions will require you to submit your transformations in Java, Python, GO or another programming language.
 
As soon as the transformation is completed, the data streams is ready and can flow into a dashboarding layer such as PowerBI, Qlikview, Tableau, etc. Naturally, the dashboarding tool needs to be able to cope with a real-time data source.
As such, the typical streaming set-up would look like this:
 
Figure 2 - Typical real-time streaming Architecture implementing in Azure
From batch to real-time reporting and analytics
click to enlargeFrom batch to real-time reporting and analytics
Available stream computational systems are (non-exhaustive list):
As element61, we believe these new technologies are not intended to be a replacement for the more traditional batch oriented analytics architectures but rather complements to the conventional technologies with new features: i.e., real time analytics, real-time insights.

Summary and conclusion

With the emergence of digitization and IoT, it becomes crucial for some companies to enable faster decision making through faster data insights. Gone are the days when everyone can wait for hours or days to look at insights and performance dashboards.
 
In this insight, we have proven that providing these real-time insights means a change in system architecture for many business intelligence teams. In our expert opinion, we concluded that a best-practice real-time architecture consists of two main building blocks: a receiving end-point to receive the data streams and a real-time analytics layer to transform the data.
 
We conclude that getting started with real-time analytics isn't complex: all leading vendors as well as open-source providers offer a serious of proven solutions and as solutions are offered through the cloud, they start with no investments needed (pay-as-you-go basis).
 
Through the use-case of real-time analytics, we conclude with this paper that new technologies are not intended to be replacements for traditional data warehousing but rather add-on's to provide new features such as real-time analytics and real-time insights.