How to include OT data in a scalable and reliable Data Platform with Azure and Lambda Architecture

OT data in the Modern Data Platform on Azure

Many production companies today have extensive OT architectures that digitalize and manage their production process. A lot of data flows through this OT setup, which is often undervalued and underused for analytics.

The benefits and added value generated by analytics are well-known in the IT department, where a Modern Data Platform setup processes large amounts of transactional data, like finance, sales and procurement data, ... and generates insights for data-driven decision-making.

Our goal is to bring OT data into the IT department, by extending the Modern Data Platform to also process real-time OT data, thereby opening the gates to an enormous new source of valuable information for advanced analytics and reporting. For this, we need a reliable architecture: enter the Lambda Architecture.


What is the Lambda Architecture?

Lambda Architecture is a data processing architecture developed to process large amounts of IoT (or other event-based) data both in real-time and in batch. The name Lambda underlines that both flows, real-time and batch, run in parallel.

The philosophy behind the architecture is that different insights can be extracted from the same data by processing it differently with different algorithms and technologies, and over different time windows.

Take, for example, sensor data in a production facility. In the hot path, it is brought to a time series visualization tool for operators to monitor the machines. Its value can for instance be augmented by applying rule-based or even AI-based anomaly detection. The main insight is: how are my machines functioning right now? In the cold path, this same data is typically aggregated and combined with master data from other sources and is used for high-level reporting. The main insight is how my facility performs over time.

What are the components of the Lamba Architecture?

The diagram below illustrates the different components required for our architecture to function effectively. It highlights the need for various technologies that must communicate with each other seamlessly. These components include:

  • A data storage system that can store large volumes of historical data
  • A message ingestion hub that can receive and distribute data in real-time
  • A stream processing engine that powers the hot path
  • A batch processing platform that supports big data computations in the cold path and ideally also supports machine learning
  • An analytical data store optimized for big data queries to feed visualization
  • Analytics and reporting tools that can serve insights to end-users.
Image
lambda-architecture

For each of these components, the Microsoft Azure cloud provides resources that implement their needs and communicate seamlessly. This article will set out a best-practice implementation in which some of the key components overlap with those of the traditional Lakehouse, essentially bringing OT data into your Modern Data Platform | element61.

Image
lambda-with-azure


How do you implement Lambda in Azure?

Implementing the Hot Path

Ingestion

πŸ‘‰ Event Hub/IoT Hub as message ingestion hub
Azure Event Hub is a service designed for streaming ingestion of data with low latency and high reliability. It can stream data from any source to a number of endpoints and is immensely scalable, supporting millions of messages per second.

Messages arriving in the queues are retained for up to 30 days, during which they can be consumed by a number of applications, such as other Azure PaaS services like Azure Stream Analytics or Azure Data Explorer, or by custom applications.

If you require bi-directional communication between the cloud and your IoT devices that generate the message data, you can go for IoT Hub. IoT Hub essentially implements a single Event Hub queue and adds a number of features on top which enable users to manage, configure and update edge devices at scale using direct and secure communication. Data-generating devices can send their telemetry to IoT Hub using protocols like HTTPS, MQTT or WebSockets.

In IoT Hub, endpoints can also be configured to Storage Accounts, Cosmos DB, other Event Hubs and more. Once configured, messages can be routed to these endpoints using configurable Routing Queries that match the properties of the messages.


Processing

πŸ‘‰ Azure Data Explorer data connection
The first approach for processing real-time data is doing no processing at all. It is limited to those cases where no analytics is performed on the data, and the only goal is getting the messages into Azure Data Explorer (ADX) from where the time series can be visualized in Grafana for pure operational dashboards. The hot path is reduced to a simple data connection between ADX and Event Hub.

A data connection is a property of an ADX database which connects a table of that database to a streaming source such that the messages in the stream get ingested automatically. Data connections can be made to IoT Hub, Event Hub and Cosmos DB.

From the ADX editor, the table needs to be created upfront with the correct schema and data types, along with a data mapping that maps the JSON fields (other formats are also supported) to the respective columns. Once the connection is configured, data is automatically mapped and ingested into the table.

πŸ‘‰ Azure Stream Analytics as stream processing engine
The second approach leverages an Azure Stream Analytics Job as a stream processing engine. Azure Stream Analytics analyzes and processes large volumes of streaming data with sub-millisecond latency.

It has a very straightforward architecture consisting of three components: inputs, outputs and a query.

Inputs configure seamless connections to streaming endpoints like IoT Hub and Event Hub. Outputs configure connections to downstream databases or applications such as Azure Data Explorer. In the query, the analytics operations are programmed in the Stream Analytics Query Language, a SQL-like dedicated syntax.

Several inputs and outputs can be configured and used at the same time. Using WHERE statements in the query, data is routed to the correct outputs. ADX outputs map to a single ADX table that needs to be created upfront. The output of the query must match the schema and the types of the ADX table.

If the operations you want to perform on the data are too complex to define in a SQL-like query, Azure Stream Analytics offers a fourth component: Functions.

The Functions component allows users to define custom functions in JavaScript, and C# or to use models from the Azure Machine Learning resource to apply to the data stream. This allows for complex operations like machine learning inference, string manipulations, complex mathematics or encoding and decoding data.

The important limitation is that the functions are stateless and must return scalar values.

πŸ‘‰ Databricks as a stream processing engine
The third approach is to leverage Databricks as a stream processing engine. Databricks offers the Apache Spark Structured Streaming framework, a real-time processing engine that implements end-to-end fault tolerance and exact-once processing guarantees.

The Structured Streaming framework uses the same API as Spark, allowing developers to write streaming analytics operations the same way they write batch processing operations. This enables the complete programmatic freedom that comes with Spark and general Python programming. The only prerequisite is installing specific Maven packages on the cluster to connect to IoT Hub and Azure Data Explorer respectively.

Virtually any analytics operation is programmable in Spark, and machine learning inference is trivial since the models are trained and validated in the same platform using the same Spark API.

There is no need to create tables and schemas upfront in ADX because this can be done in code from within Databricks.

Analytics & visualization

πŸ‘‰ Azure Data Explorer as an analytical data store
Azure Data Explorer is a fully managed, high-performance, big data analytics platform. It uses the traditional relational data model where data is organized in tables with fixed schemas, but unlike SQL-based database technologies, it is optimized for log analytics, time series analytics and IoT. ADX clusters handle multiple databases.

It can handle semi-structured and unstructured data by supporting mapping transformations that happen automatically at ingestion time.

Data is queried in the Kusto Query Language (KQL), a highly optimized open-source language developed by the team behind Azure Data Explorer. The resource comes with a web UI complete with a query editor to perform ad-hoc analysis and to develop queries used for example in Grafana, and it also supports creating dashboards.

Since ADX is optimized for analytics, it is not meant as a replacement for storage accounts or SQL databases. By default, data is retained for a limited time only. In the framework of our Lambda Architecture in Azure, it only serves as a temporary store for the real-time data and analytics results that are ingested by the streaming engine of the hot path.

πŸ‘‰ Grafana for real-time analytics and reporting
The preferred visualization tool in the hot path is Grafana. It is optimized to visualize time-series data, offers an extensive list of graph types to easily create custom dashboards and is capable of refreshing in real time. Grafana is open source, with a large community that builds extensions and plugins.

Azure has an Azure Managed Grafana resource, making it a plug-and-play component for the Lambda Architecture. Especially because Grafana is very compatible with ADX. It supports writing queries in the Kusto language as well as building queries in a no-code, form-based configuration window.

Implementing the Cold Path

Ingestion

πŸ‘‰ Event Hub capture
The implementation of cold path ingestion is as simple as enabling the Capture feature from your Event Hub queue. Every minute, all messages of that minute are combined into a blob that can be in JSON or Avro format, which is written to a path on the lake that has at least the following parameters: {iothub}, {partition}, {YYYY}, {MM}, {DD}, {HH}, {mm}. These are automatically filled in at write time. The interval width can range from 1 minute to 15 minutes.

Processing

πŸ‘‰ Databricks as batch processing engine
Databricks fills the role of batch processing platform (see also Databricks | element61). From within Databricks, the Avro or JSON blobs are loaded using the wildcard path formatting: /path/to/blobs/of/iothub_1/partition_1/2023/12/01/*/*.avro, to get, in this example case, a data batch of one day.

From that point on, data transformations and analytics can happen as they do for no real-time data sources in the traditional Data Lakehouse.

The rich set of features available in Databricks allows for doing advanced analytics and machine learning, generating a lot of value from your OT data.

Analytics & visualization

πŸ‘‰ Power BI for analytics and reporting
In a Data Lakehouse, data eventually ends up in the serving layer where it is visualized in Power BI, which gets its data directly from the gold layer of the data lake (see alsoΒ Microsoft Power BI: Self-service business intelligence).Β 

What is the difference between Power BI and Grafana?

In the lambda architecture, both paths end up in visualization. However, using the right tool for the right purpose is important.

As mentioned, Grafana is optimized for visualizing time-series data. It processes this type of data very quickly and provides visuals and query builders that link to the time window easily. Power BI is not at all optimized for this kind of data. It works best if it loads all data into memory and excels in allowing users to create drill-through visuals in a self-service fashion. OT data is typically aggregated up to the granularity of transactional context or master data before being used in Power BI reports, where the data serves the purpose of generating long-term, trend-based insights, rather than generating real-time, short-term insights.

Conclusion

The Lambda Architecture is a very powerful architecture that allows businesses to get the most out of their IoT data by processing it both in real-time and in batch, extracting different insights in each flow. It consists of a considerable number of components that each fulfil a very specific role, have specific requirements and need to communicate with each other.

The Azure cloud offers implementations for each of these components that meet all requirements and integrate seamlessly with each other. The proposed implementation results in a best-practice, enterprise-ready and scalable solution that clicks into the batch-oriented Lakehouse architecture solution to form an all-purpose Modern Data Platform.