Azure Data Factory

 

Azure Data Factory is a Microsoft cloud service offered by the Azure platform that allows data integration from many different sources. Azure Data Factory is a perfect solution when in need of building hybrid extract-transform-load (ETL), extract-load-transform (ELT) and data integration pipelines.

Azure Data Factory

What does Azure Data Factory do?

It allows you to:

  • Copy data from many supported sources both on-premise and cloud sources
  • Transform the data (cf. below paragraphs)
  • Publish the copied and transformed data, sending it to a destination data storage or analytics engine
  • Monitor the data flows using a rich graphical interface

What doesn’t Azure Data Factory do?

Data Factory isn’t SSIS (SQL Server Integration Services) in the cloud. It has less database specific features and focuses on supporting broader data transformation & movements (incl. big datasets, incl. data lake operations).

Data Factory can, however, run your SSIS packages in the Cloud (once build in SSIS). This allows to leverage Data Factory’s scalability with SSIS’s advanced ETL features.

Why do I need Azure Data Factory?

Data Factory is an enabler for any Cloud projects. In almost any Cloud project, you will need to perform data movement activities across various networks (on-premise network and Cloud) and across various services (i.e. from and to close different Azure storages).

Data Factory is particularly a required enabler for organizations who are making their first steps in the Cloud & who thus try to connect on-premise data with the Cloud. For this, Azure Data Factory has an Integration Runtime engine, a Gateway service that can be installed on-premise which guarantees performant & secure transfer of data from & to the cloud.

How does it differ from other ETL Tools?

Data Factory is one option to use as cloud ETL (or ELT) tool. There are some features that distinguish Azure Data Factory from other tools.

  • It also has the ability to run SSIS packages
  • It auto-scales (fully managed PaaS product) based on the given workload.
  • It allows to run up to once per minute
  • It bridges on-premise & Azure Cloud seamlessly through a gateway
  • It can handle big data volumes
  • It can connect & work together with other compute services (Azure Batch, HDInsights) to even run truly big data computations during ETL

From our expertise, the best alternative to Azure Data Factory would be Apache Airflow which has its advantages but also disadvantages. Contact us for more details.

How do I work with Azure Data Factory?

Azure Data Factory is a user interface tool which offers a very graphical overview to create/manage activities and pipelines. It doesn’t require coding skills, yet complex transformation will require Azure Data Factory experience.

Azure Data Factory
click to enlargeAzure Data Factory

Important features:

  • Azure Data Factory has default connectors with close to all on-premise data sources including MySQL, SQL Server, Oracle DBs

Azure Data Factory.
click to enlargeAzure Data Factory

  • Azure Data Factory supports branching, where the output of one activity can be a trigger for the start of another activity.
    - e.g. first copy the data from on-premise to Blob, then merge all blobs
  • Azure Data Factory support tumbling window trigger & event trigger. The first is particularly relevant in creating partitioned data in for example a Data Lake set-up (for example storing your data automatically in daily partitioned blobs: e.g. YYYY/MM/DD/Blob.csv).
    An event trigger is applicable when an event such as a new Blob on Blob Storage should automatically trigger a transformation.
  • Azure Data Factory allows to work with parameters and thus enables to pass on parameters dynamically between datasets, pipelines & triggers. An example could be that the filename of the destination file should have the name of the pipeline or should be the date of the data slice.
  • Azure Data Factory allows to run pipeline up to 1 run per minute. It thus doesn’t allow real-time but enables close to real-time.  
  • Azure Data Factory provides monitoring & alerting. The execution of the different pipelines can be easily monitored through the UI & you can set-up alerts (linked to Azure Monitor) if anything fails.

Azure Data Factory
click to enlargeAzure Data Factory

  • Azure Data Factory can work well with Azure Databricks to schedule ML algorithms. Read more about this in this insight.

How does Azure Data Factory work with other Azure resources?

Azure Data Factory
click to enlargeAzure Data Factory

One of the main advantages of Azure Data Factory is that it integrates great with other Azure Compute & Storage resources. This is the exact purpose of linked services: i.e. to define the connection to external resources. There are 2 kinds of linked services you can define:

  • A Data Store Service to: Azure SQL Database, Azure SQL Data-warehouse, an on-premises databases, a Data Lake, a filesystem, a NoSQL DB, etc.
  • A Compute Service to transform and enrich data: e.g., Azure HDInsight, Azure Machine Learning, Stored Procedure in any SQL, Data Lake Analytics U-SQL activity, Azure Databricks and/or Azure Batch (using Custom Activity)

The pricing of Data Factory is based on usage: number of “activities” (data processing steps) per month and the integration runtime usage is charged per hour depending on the machine the number of nodes used.

Should I use Azure Data Factory or SSIS?

Use the right tool for the right purpose. Through below overview you understand that they are complementary. They are also built that way: i.e., Azure Data Factory also offers the ability to deploy, manage and run SSIS packages in managed Azure SSIS Integration Runtimes.

Based on your current platform/solution:

 

Hybrid On-Prem
& Azure Solution

Azure Solution

On-Prem
Only Solution

Azure Data Factory
(ADF V2)

Yes

Yes

No

Integration Services (SSIS)

Yes

Yes

Yes

Based on type of data:

 

Small data

Close to
real-time data
(every minute)

Big Data

Azure Data Factory
(ADF V2)

Yes

Yes

Yes

Integration Services (SSIS)

Yes

No

No

 

Conclusion

Data Factory offers you the possibility to easily integrate cloud data with on-premises data. It’s unique in its ease of use despite its ability to transform and enrich complex data. It delivers data integration which is scalable, available and at low costs. Today, this service is a crucial building block in any data platform & machine learning project.

element61 offers a hands-on expertise with Azure Data Factory and has implemented Data Factory setups for various clients and use cases. We can help the organizations with our in-depth understanding of the concepts on ADF and our experience in building end-to-end implementations.

Continue Reading

Continue reading or contact us to get started with Azure Data Factory

Contact us for more information on Azure Data Factory!