Connecting Power BI with the Databricks Lakehouse

Introduction

Over the last months, the concept of a Lakehouse has become more prevalent within organizations. Briefly explained, all your data of your company is kept in the Data Lake without a need for a Data Warehouse, but with the advantages and features that were typically associated with traditional Data Warehouses in the past.

One of the required features of a Lakehouse is that BI tools can connect directly to the data in the Data Lake. This was not possible (not in a performant way anyways) a couple of years ago. A Data Warehouse was needed to expose your data in the Data Lake to front-end tools like Power BI. This often caused duplication of the data which is illustrated in the schema below.

Schema duplication of the data

Recently, there started to arise different ways of providing a direct connection from front-end tools to the Data Lake.

In this article, we will focus on Databricks Lakehouse, which offers a Lakehouse implementation based on the delta format. We will describe how Power BI can connect directly to the delta tables in Databricks.

Requirements

The following prerequisites are needed to be able to connect Power BI to Databricks.

  • Installation of Power BI Desktop of 2.85.681.0 or above
  • Access token for Databricks or AAD access to Databricks
  • An Azure Databricks Cluster or Databricks SQL Warehouse

Additionally, if you want to use features such as Partner Connect or you want to enable incremental refresh in Power BI on Databricks tables you will need a Premium Plan for your Azure Databricks account. Both features rely on Databricks SQL which is only available in the Databricks Premium Plan.

Connect using Partner Connect in Databricks

Databricks Partner Connect is a portal that allows for easy connection with Databricks partners. In the ‘BI and visualization’ section Databricks offers a connection to Power BI. 

Databricks Partner Connect

Figure 1: Databricks Portal

When clicking on the Microsoft Power BI option, you will be prompted to download a connection file. When you are in the Machine Learning or Data Science & Engineering view you have the choice out of two types of connections:

  • Connect to an interactive cluster
  • Connect to an SQL Warehouse

Databricks Portal

Figure 2: Partner Connect

When you are in the SQL view you will only get the option to connect to an SQL Warehouse.

The connection file to be downloaded consists of a pbix file containing the connection info needed for the specific cluster/SQL Warehouse you selected. When opening the file, you will be asked for either a token or an AAD authentication.

Partner Connect

Figure 3: Power BI Authentication

If the authentication is successful, you will receive the screen below and can select the tables you want to load in your report. Any tables that are available in the hive metastore in Databricks can be loaded and queried independently of their underlying format.

Power BI Authentication

Figure 4: Power BI Navigator

The tables serve as a passthrough to a Data Lake. Depending on how the tables were created in Databricks, this is either a Data Lake in the managed resource group created automatically when initiating a Databricks application, or it can also point to a location in your mounted Data Lake.

If the cluster or SQL Warehouse is not yet running, it will be triggered to start up which can take some minutes. By default, the connection is created using a direct query, but you can change this afterward to import mode.

Connect using the Databricks connector in Power BI

If you are on the standard plan of Databricks and you do not have Partner Connect available, you can also use the Databricks connector in Power BI to access Databricks tables.

If you go to the Power BI desktop, you can search for the ‘Azure Databricks’ connector through the ‘get data’ pane.

Power BI Navigatot

Figure 5: Databricks Connector

Since the connection to Databricks will go through a cluster or SQL Warehouse you will be prompted to enter the server hostname and http path which can be found through the Databricks UI.

Databricks Connector

Figure 6: Databricks Configuration

You can find the necessary connection information for both interactive clusters and SQL Warehouses in following locations:

Interactive cluster
Navigate in the Data Science & Engineering view to Compute > Cluster > Configuration > Advanced options > JDBC/ODBC

Interactive cluster

Figure 7: Interactive Cluster

SQL Warehouse

Navigate in the SQL view to SQL Warehouses > SQL Warehouse > Connection details

SQL Warehouse

Figure 8: SQL Warehouse

The Data Connectivity Mode option lets you choose whether you want to load the delta tables using import or direct query.

When you successfully made the connection, you can follow the same authentication steps as for the Partner Connect option.

Incremental refresh

An important note to make is that to be able to deploy an incremental refresh in Power BI on Databricks tables, we need a connection to Databricks through an SQL Warehouse. The SQL Warehouse allows for custom queries and therefore query folding, which cannot be achieved using a connection through an interactive cluster.

Conclusion

As a result of being able to connect PBI directly to the Data Lake via Databricks, we can cut out the Data Warehouse layer that was represented by Synapse in our first schema. The resulting simplified schema is shown in the visual below.

Final Schema

Additional links: