Over the last months, the concept of a Lakehouse has become more prevalent within organizations. Briefly explained, all your data of your company is kept in the Data Lake without a need for a Data Warehouse, but with the advantages and features that were typically associated with traditional Data Warehouses in the past.
One of the required features of a Lakehouse is that BI tools can connect directly to the data in the Data Lake. This was not possible (not in a performant way anyways) a couple of years ago. A Data Warehouse was needed to expose your data in the Data Lake to front-end tools like Power BI. This often caused duplication of the data which is illustrated in the schema below.
Recently, there started to arise different ways of providing a direct connection from front-end tools to the Data Lake.
In this article, we will focus on Databricks Lakehouse, which offers a Lakehouse implementation based on the delta format. We will describe how Power BI can connect directly to the delta tables in Databricks.
The following prerequisites are needed to be able to connect Power BI to Databricks.
- Installation of Power BI Desktop of 2.85.681.0 or above
- Access token for Databricks or AAD access to Databricks
- An Azure Databricks Cluster or Databricks SQL Warehouse
Additionally, if you want to use features such as Partner Connect or you want to enable incremental refresh in Power BI on Databricks tables you will need a Premium Plan for your Azure Databricks account. Both features rely on Databricks SQL which is only available in the Databricks Premium Plan.
Connect using Partner Connect in Databricks
Databricks Partner Connect is a portal that allows for easy connection with Databricks partners. In the ‘BI and visualization’ section Databricks offers a connection to Power BI.
Figure 1: Databricks Portal
When clicking on the Microsoft Power BI option, you will be prompted to download a connection file. When you are in the Machine Learning or Data Science & Engineering view you have the choice out of two types of connections:
- Connect to an interactive cluster
- Connect to an SQL Warehouse
Figure 2: Partner Connect
When you are in the SQL view you will only get the option to connect to an SQL Warehouse.
The connection file to be downloaded consists of a pbix file containing the connection info needed for the specific cluster/SQL Warehouse you selected. When opening the file, you will be asked for either a token or an AAD authentication.
Figure 3: Power BI Authentication
If the authentication is successful, you will receive the screen below and can select the tables you want to load in your report. Any tables that are available in the hive metastore in Databricks can be loaded and queried independently of their underlying format.
Figure 4: Power BI Navigator
The tables serve as a passthrough to a Data Lake. Depending on how the tables were created in Databricks, this is either a Data Lake in the managed resource group created automatically when initiating a Databricks application, or it can also point to a location in your mounted Data Lake.
If the cluster or SQL Warehouse is not yet running, it will be triggered to start up which can take some minutes. By default, the connection is created using a direct query, but you can change this afterward to import mode.
Connect using the Databricks connector in Power BI
If you are on the standard plan of Databricks and you do not have Partner Connect available, you can also use the Databricks connector in Power BI to access Databricks tables.
If you go to the Power BI desktop, you can search for the ‘Azure Databricks’ connector through the ‘get data’ pane.
Figure 5: Databricks Connector
Since the connection to Databricks will go through a cluster or SQL Warehouse you will be prompted to enter the server hostname and http path which can be found through the Databricks UI.
Figure 6: Databricks Configuration
You can find the necessary connection information for both interactive clusters and SQL Warehouses in following locations:
Navigate in the Data Science & Engineering view to Compute > Cluster > Configuration > Advanced options > JDBC/ODBC
Figure 7: Interactive Cluster
Navigate in the SQL view to SQL Warehouses > SQL Warehouse > Connection details
Figure 8: SQL Warehouse
The Data Connectivity Mode option lets you choose whether you want to load the delta tables using import or direct query.
When you successfully made the connection, you can follow the same authentication steps as for the Partner Connect option.
An important note to make is that to be able to deploy an incremental refresh in Power BI on Databricks tables, we need a connection to Databricks through an SQL Warehouse. The SQL Warehouse allows for custom queries and therefore query folding, which cannot be achieved using a connection through an interactive cluster.
As a result of being able to connect PBI directly to the Data Lake via Databricks, we can cut out the Data Warehouse layer that was represented by Synapse in our first schema. The resulting simplified schema is shown in the visual below.