Creating a data lakehouse with Drag-and-Drop ETL in Microsoft Fabric

Microsoft Fabric is the new Microsoft branded analytics platform that integrates data warehousing, data integration and orchestration, data engineering, data science, real-time analytics, and business intelligence into a single product. This new one-stop shop for analytics brings a set of innovative features and capabilities by providing a transformative tool for data professionals and business users. With Microsoft Fabric, we can leverage the power of data and AI to transform organizations and create new opportunities from data.

For more Microsoft Fabric topics πŸ‘‰ Microsoft Fabric | element61

Data Factory and Data Engineering are two of the main building blocks in Microsoft Fabrics unified analytics platform.

This article will focus on various data transformation techniques within Data Factory and Data Engineering experiences in Microsoft Fabric and narrow down the scenarios to explain which technique is best fit for each of those scenarios. Since this article focuses mainly on the technicalities of data transformations, there will be another article written to explain about the positioning of each of these tools in an enterprise context, self-service E-T-L/E-L-T and overall orchestration.

How do I kickstart my ETL/ELT in Microsoft Fabric?

To kickstart your data transformation journey in the Microsoft Fabric workspace, one must first activate the Microsoft Fabric experience through either enable it for your organization or try it out a free trial.

Once you enabled the Microsoft Fabric workspace, we can find at the left-bottom corner - either Data Factory experience or Data Engineering experience. Each experience presents us with options that might be closely related to the items for the corresponding persona. The Data Factory experience or Data Engineering experience both provide options to easily create drag-drop ETL/ELT data pipelines.

In the Microsoft Fabric workspace you can have two options to get started with our first data pipeline or data flow:

Data pipelines are primarily inspired by standalone Azure Data Factory pipelines, while Microsoft Fabric Dataflows Gen2 is an extension of Power BI Dataflows Gen1 but much more powerful. Dataflows Gen1 is a Power BI self-service data transformation technique to prepare data closer to business requirements.

We have made a comprehensive comparison of the three tools - Data Factory pipelines, Dataflows Gen2 and Dataflows Gen1 in one of the following sections.

A key takeaway is that Microsoft Fabric has unified the experience of Power Query based Dataflows and Data Factory based pipelines in one platform to maximize the transformation capabilities while being more user-friendlyΒ πŸ’‘

What are the differences between Power BI Dataflows gen1 and Dataflows gen2?

Microsoft Fabric Dataflows gen2 is an extension of Power BI Dataflows gen1. While both support Power Query authoring and have get data capabilities with get data connectors, the key difference is Dataflows gen2 can now write data into various output destinations which enables the transformations and loading to be isolated.
Another difference is that Dataflows gen2 can be used in combination with Data Factory pipelines within Microsoft Fabric, enabling us to perform data transformation which are reusable in a data engineering scenario. There are 150+ data sources available for Dataflows gen2 versus 30+ for Dataflows gen1. Since Dataflows gen2 is still in preview and it would be anyway possible later to migrate from Dataflows gen1 to Dataflows gen2, but it's still advisable to use Dataflows gen1 if applicable for the need of Power BI datasets or other reusability scenarios.

Feature

Dataflows Gen2

Dataflows Gen1

Author Dataflows with Power Query

βœ“

βœ“

Shorter authoring flow

βœ“

Auto-Save and background publishing

βœ“

Output destinations

βœ“

Improved monitoring and refresh history

βœ“

Integration with data pipelines

βœ“

High-scale compute

βœ“

Get Data via Dataflows connector

βœ“

βœ“

Direct Query via Dataflows connector

βœ“

Incremental refresh

βœ“

AI Insights support

βœ“

What are the differences between Power BI Dataflows and Data Factory pipelines?

Power BI Dataflows and Azure Data Factory pipelines are both tools for data transformation and integration, but they have some key differences. Power BI Dataflows are designed for self-service data preparation and analysis, while Azure Data Factory pipelines are more suitable for complex and scalable data engineering scenarios. Power BI Dataflows use the Power Query Online service to perform data transformations, while Azure Data Factory pipelines can use various activities and connectors to orchestrate data movement and processing across different sources and destinations. Microsoft has documented the differences quite well here.

Some of the notable differences are:

  • There is no mapping data flow yet, but we have Dataflows gen2.
  • Triggers are now called schedules.
  • There is no Integration Runtime, instead of Self Hosted Integration Runtime now we have On-prem data gateway.
  • Linked services are now connections.
  • Dataset concept is not there anymore.

How do you create your first Data Factory pipeline?

A data factory pipeline is a logical grouping of activities that perform a specific task. For example, a pipeline can copy data from one source to another, transform data, or run a machine learning model. A pipeline can also have parameters, variables, and triggers that define its execution logic and dependencies. We will describe three techniques to create ingestion or transformation pipelines:

Technique 1: Create a pipeline using the Data Factory activities manually

Data factory activities are the basic units of work that perform data operations in Azure Data Factory. They can be used to copy data, transform data, run pipelines, execute queries, and more. Data factory activities can be grouped into four categories: data movement, data transformation, control, and orchestration.

The Microsoft Fabric developer's experience for creating the pipeline is depicted below with the key findings:

  • We can create a pipeline in a similar way how we would create it in Azure Data Factory.
  • Almost all activities remain the same.
  • Calling a Databricks notebook is not yet possible.
  • One powerful addition πŸ’ͺ is the "Office 365" activity which enables you to get notified in case of an event.
  • On the other hand it's not yet possible to edit the pipeline representation in JSON source code, but this will be released in the future.

How do I connect to the data sources?

Similarly to the linked services in Azure Data Factory, we now have connections in the Fabric connection settings. The connections allow us to connect to external sources to read data and to write data to destinations.

Technique 2: Create a pipeline using the Copy Data Assistant

The copy data assistant in Azure Data Factory is a tool that helps you create and manage data pipelines. It allows you to specify the source and destination of your data:

  • If we have a scenario to copy ad hoc data quickly to Azure, we can make use of the Copy Data Assistant as below.
  • It has the feature to copy data from a variety of sources and write to a multitude of destinations

Technique 3: Select a pipeline from the templates

Microsoft made sure we also have templates like already existing in Data Factory and in Synapse so we can accelerate our development. The pipeline template feature in Azure Data Factory is a way to create and reuse pipelines with predefined activities and parameters. It allows you to quickly build and deploy pipelines for common scenarios, such as data ingestion, data transformation, or data movement.

Before we can make use of a template, we can make the necessary connections to be needed by the templates. Templates have pre-built activities that can fulfill our transformation needs.

How do you create your first dataflows gen2 transformation?

Dataflows gen2 is an excellent addition πŸ™Œ to the data engineering experience, now you can leverage the dataflows transformation capabilities within data factory. The 3 main steps (E-T-L) are depicted below:

  • 1st step - Get Data from 150+ sources
  • Then we can choose from 300+ transformations
  • At the end we can choose 4 types of destinations for now

One interesting scenario is - once a data flow is created, we can use it in our data pipeline to do the necessary transformation - interestingly we are combining here data extraction from Data Factory with Power Query data transformations - unifying two worlds at one go!πŸ’‘

Β 

Dataflows Gen2 Sources

Dataflows Sources are the data sources that can be used to create dataflows in Power BI. Dataflows Gen2 offers to load data from a variety of sources like On-premises, Azure, Dataverse, APIs, Other cloud storages and many more (150+ sources).

What if you already created dataflows gen1 transformations?

There would be scenarios, where a lot of development has happened in Power BI self-service dataflows gen1 and the transformations are now to be migrated in Fabric πŸ’‘
It is possible that we reuse the dataflows gen1 effort with the 2 techniques described below:

Technique 1 - Export and Import

We can simply export the existing gen1 dataflows into PQT file and import them back in gen2 dataflows.

Technique 2 - Edit Power Query and copy-paste

The query of gen1 dataflows can simply be copied and pasted in the advanced editor of gen2 dataflows as below.