Creating a data lakehouse with Drag-and-Drop ETL in Microsoft Fabric

Microsoft Fabric is the new Microsoft branded analytics platform that integrates data warehousing, data integration and orchestration, data engineering, data science, real-time analytics, and business intelligence into a single product. This new one-stop shop for analytics brings a set of innovative features and capabilities by providing a transformative tool for data professionals and business users. With Microsoft Fabric, we can leverage the power of data and AI to transform organizations and create new opportunities from data.

For more Microsoft Fabric topics 👉 Microsoft Fabric | element61

Data Factory and Data Engineering are two of the main building blocks in Microsoft Fabrics unified analytics platform.

This article will focus on various data transformation techniques within Data Factory and Data Engineering experiences in Microsoft Fabric and narrow down the scenarios to explain which technique is best fit for each of those scenarios. Since this article focuses mainly on the technicalities of data transformations, there will be another article written to explain about the positioning of each of these tools in an enterprise context, self-service E-T-L/E-L-T and overall orchestration.

How do I kickstart my ETL/ELT in Microsoft Fabric?

To kickstart your data transformation journey in the Microsoft Fabric workspace, one must first activate the Microsoft Fabric experience through either enable it for your organization or try it out a free trial.

Once you enabled the Microsoft Fabric workspace, we can find at the left-bottom corner - either Data Factory experience or Data Engineering experience. Each experience presents us with options that might be closely related to the items for the corresponding persona. The Data Factory experience or Data Engineering experience both provide options to easily create drag-drop ETL/ELT data pipelines.

In the Microsoft Fabric workspace you can have two options to get started with our first data pipeline or data flow:

Data pipelines are primarily inspired by standalone Azure Data Factory pipelines, while Microsoft Fabric Dataflows Gen2 is an extension of Power BI Dataflows Gen1 but much more powerful. Dataflows Gen1 is a Power BI self-service data transformation technique to prepare data closer to business requirements.

We have made a comprehensive comparison of the three tools - Data Factory pipelines, Dataflows Gen2 and Dataflows Gen1 in one of the following sections.

A key takeaway is that Microsoft Fabric has unified the experience of Power Query based Dataflows and Data Factory based pipelines in one platform to maximize the transformation capabilities while being more user-friendly 💡

What are the differences between Power BI Dataflows gen1 and Dataflows gen2?

Microsoft Fabric Dataflows gen2 is an extension of Power BI Dataflows gen1. While both support Power Query authoring and have get data capabilities with get data connectors, the key difference is Dataflows gen2 can now write data into various output destinations which enables the transformations and loading to be isolated.
Another difference is that Dataflows gen2 can be used in combination with Data Factory pipelines within Microsoft Fabric, enabling us to perform data transformation which are reusable in a data engineering scenario. There are 150+ data sources available for Dataflows gen2 versus 30+ for Dataflows gen1. Since Dataflows gen2 is still in preview and it would be anyway possible later to migrate from Dataflows gen1 to Dataflows gen2, but it's still advisable to use Dataflows gen1 if applicable for the need of Power BI datasets or other reusability scenarios.

Feature

Dataflows Gen2

Dataflows Gen1

Author Dataflows with Power Query

✓

✓

Shorter authoring flow

✓

Auto-Save and background publishing

✓

Output destinations

✓

Improved monitoring and refresh history

✓

Integration with data pipelines

✓

High-scale compute

✓

Get Data via Dataflows connector

✓

✓

Direct Query via Dataflows connector

✓

Incremental refresh

✓

AI Insights support

✓

What are the differences between Power BI Dataflows and Data Factory pipelines?

Power BI Dataflows and Azure Data Factory pipelines are both tools for data transformation and integration, but they have some key differences. Power BI Dataflows are designed for self-service data preparation and analysis, while Azure Data Factory pipelines are more suitable for complex and scalable data engineering scenarios. Power BI Dataflows use the Power Query Online service to perform data transformations, while Azure Data Factory pipelines can use various activities and connectors to orchestrate data movement and processing across different sources and destinations. Microsoft has documented the differences quite well here.

Some of the notable differences are:

  • There is no mapping data flow yet, but we have Dataflows gen2.
  • Triggers are now called schedules.
  • There is no Integration Runtime, instead of Self Hosted Integration Runtime now we have On-prem data gateway.
  • Linked services are now connections.
  • Dataset concept is not there anymore.

How do you create your first Data Factory pipeline?

A data factory pipeline is a logical grouping of activities that perform a specific task. For example, a pipeline can copy data from one source to another, transform data, or run a machine learning model. A pipeline can also have parameters, variables, and triggers that define its execution logic and dependencies. We will describe three techniques to create ingestion or transformation pipelines:

Technique 1: Create a pipeline using the Data Factory activities manually

Data factory activities are the basic units of work that perform data operations in Azure Data Factory. They can be used to copy data, transform data, run pipelines, execute queries, and more. Data factory activities can be grouped into four categories: data movement, data transformation, control, and orchestration.

The Microsoft Fabric developer's experience for creating the pipeline is depicted below with the key findings:

  • We can create a pipeline in a similar way how we would create it in Azure Data Factory.
  • Almost all activities remain the same.
  • Calling a Databricks notebook is not yet possible.
  • One powerful addition 💪 is the "Office 365" activity which enables you to get notified in case of an event.
  • On the other hand it's not yet possible to edit the pipeline representation in JSON source code, but this will be released in the future.

How do I connect to the data sources?

Similarly to the linked services in Azure Data Factory, we now have connections in the Fabric connection settings. The connections allow us to connect to external sources to read data and to write data to destinations.

Technique 2: Create a pipeline using the Copy Data Assistant

The copy data assistant in Azure Data Factory is a tool that helps you create and manage data pipelines. It allows you to specify the source and destination of your data:

  • If we have a scenario to copy ad hoc data quickly to Azure, we can make use of the Copy Data Assistant as below.
  • It has the feature to copy data from a variety of sources and write to a multitude of destinations

Technique 3: Select a pipeline from the templates

Microsoft made sure we also have templates like already existing in Data Factory and in Synapse so we can accelerate our development. The pipeline template feature in Azure Data Factory is a way to create and reuse pipelines with predefined activities and parameters. It allows you to quickly build and deploy pipelines for common scenarios, such as data ingestion, data transformation, or data movement.

Before we can make use of a template, we can make the necessary connections to be needed by the templates. Templates have pre-built activities that can fulfill our transformation needs.

How do you create your first dataflows gen2 transformation?

Dataflows gen2 is an excellent addition 🙌 to the data engineering experience, now you can leverage the dataflows transformation capabilities within data factory. The 3 main steps (E-T-L) are depicted below:

  • 1st step - Get Data from 150+ sources
  • Then we can choose from 300+ transformations
  • At the end we can choose 4 types of destinations for now

One interesting scenario is - once a data flow is created, we can use it in our data pipeline to do the necessary transformation - interestingly we are combining here data extraction from Data Factory with Power Query data transformations - unifying two worlds at one go!💡

 

Dataflows Gen2 Sources

Dataflows Sources are the data sources that can be used to create dataflows in Power BI. Dataflows Gen2 offers to load data from a variety of sources like On-premises, Azure, Dataverse, APIs, Other cloud storages and many more (150+ sources).

What if you already created dataflows gen1 transformations?

There would be scenarios, where a lot of development has happened in Power BI self-service dataflows gen1 and the transformations are now to be migrated in Fabric 💡
It is possible that we reuse the dataflows gen1 effort with the 2 techniques described below:

Technique 1 - Export and Import

We can simply export the existing gen1 dataflows into PQT file and import them back in gen2 dataflows.

Technique 2 - Edit Power Query and copy-paste

The query of gen1 dataflows can simply be copied and pasted in the advanced editor of gen2 dataflows as below.

A Quick look at the Spark Notebook

Since our focus for the article is drag-drop ETL/ELT for Data Transformation, we will not put more emphasis on code-based data transformations. Spark notebooks are web-based files that contain code to ingest, transform and load data, create visualizations, and add narrative text. You should be aware that we can leverage a previously created notebook in our Data Factory pipelines should there be a need for it.

As we can see in the following diagram, it is possible to add a notebook to the pipeline. The notebook also helps to quickly perform some ad-hoc data analysis.

How do you choose between Data Factory pipelines, Dataflows Gen2 and Spark Notebooks?

👉 Persona-based selection

The transformation can be done with simple data pipeline activities, from within a stored procedure or a script. The transformations can also be done with Dataflows Gen2, or it can also be done in a Notebook. Microsoft has created a documentation for the comparison which is visually represented below.
Microsoft has really ensured that people with different skillsets of data engineering can now work together in a single unified analytics platform.

👉 Use case-based selection

Some real-life day-to-day data engineering scenarios are listed below, and we have explained which tool in Microsoft Fabric would be best âš¡for handling them:

Real-life scenarios in Data Engineering

Which tool in Microsoft Fabric best fit the scenario

I need to copy data from on-premises to Lakehouse for 1000's of Tables of ERP

Since the process needs to repeat many times - copy data activity in a loop within a data factory pipeline seems best fit

I need to do 1 time (or on-demand) copy of my Sales Target

We can use copy data assistant tool

I need to copy data from APIs to Lakehouse

We can use copy data activity within a data factory pipeline

I need to perform complex transformation with customer functions

We can leverage the open-source libraries in a spark notebook and do the transformation there

I need to perform column level transformations

We can use dataflows gen 2, there are more than 300 transformations available

I need to Join multiple tables to consolidate my Sales Fact

We can use dataflows gen 2, like in Power Query, it can handle joins/unions of multiple tables

I need to parse complex Json and xml files coming from my source

We can leverage spark notebook to flatten the multi-hierarchy files and then do further transformations

What are the limitations of the pipelines in Microsoft Fabric?

Pipelines in Microsoft Fabric are a powerful feature that enable the orchestration of complex workflows across multiple microservices in one unified data analytics environment. They allow developers to define the sequence, parallelism, and conditions of the tasks that make up a business process represented in dimensions and facts. It's still a work in progress, and we're looking forward to see the following features in Microsoft Fabric:

  • Not possible be edit the JSON of a data pipeline.
  • Not possible to use the connections created directly from copy activity
  • On-premises connection is only through gateway, no support yet for self-hosted integration runtime
  • The number of sources available in copy data is limited compared to how many are available in Dataflows gen2.