You are here

Microsoft Azure Data Factory

Microsoft Azure Data Factory

Azure, Microsoft’s cloud offering has strongly evolved over the last years and this evolution continues with a lot of new services, including Azure Data Factory, a cloud based data processing service to orchestrate data storage, data processing and data flow management.

Azure Data Factory allows you to:

  • Import and combine data from cloud-based, on-premises and internet data sources
  • Apply complex transformations on this data
  • Create trustable information out of structured and non-structured data for use in analytical applications
  • Monitor the data flows using a rich graphical interface

With JSON scripts data can be imported from all kinds of different data sources to a data hub using Azure Blob Storage in an HDInsight cluster. This HDInsight cluster is part of the data factory service so does not have to be installed or configured separately.

The data in the Data Hub can be transformed and enriched using Pig, Hive and C# scripts.

Through JSON scripts the transformed data can be published to another data store for use in for example analytical applications.

Microsoft Azure Data Factory Architecture

To import from or export to on-premises data stores a Data Management Gateway needs to be configured. This is software that needs to be installed within the same network as the on-premises server. This Data Management Gateway handles the secure connection to the cloud.

The following data flows are currently supported:

Microsoft Azure Data Factory - data flows

Azure Data Factory offers a very graphical overview of the storage, processing and movement of the data (pipelines) between the different linked services. These pipelines and services can also easily be maintained through this interface, for example to restart a data load.

Microsoft Azure Data Factory Screenshot 1

Microsoft Azure Data Factory Screenshot 2

There are 2 kinds of linked services:

  • A Data Storage Service to store data: Azure, on-premises databases, file systems …
  • A Compute Service to transform and enrich data: Azure HDInsight, Azure Machine Learning …

The pricing of Data Factory is based on usage: number of "activities” (data processing steps) per month. There is a separate pricing for low and heavy usage, and for cloud and on-premises data.

Conclusion

Data Factory offers you the possibility to easily integrate cloud data with on-premises data. The internal processing in an HDInsight cluster, using Pig and Hive scripts, offers powerful internal computation to transform and enrich complex data. Currently there is no SSIS-like designer to build data flows. Because of this Data Factory won’t immediately replace Integration Services but the popularity of cloud storage and the lack of support for cloud data in standard BI tools make Data Factory a nice addition to Microsoft’s cloud offering. Today this service offers the highest value in "big data” projects, especially concerning the "variety” and "velocity” aspects of data sources where this data need to be enriched with data from a corporate data warehouse or vice versa.

Azure Data Factory is currently in preview. More information is available at http://azure.microsoft.com/en-us/services/data-factory/

Contact us for more information on Azure Data Factory !

Experts