Unlocking Data Synergy: The Power of Databricks Federated Queries

TL;DR

Databricks Federated Queries enable seamless data querying across multiple sources, eliminating the need for data duplication. By bridging connections with various databases (including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Microsoft SQL Server, Azure Synapse, Google BigQuery or other Databricks catalogs), Federated Queries ensure unified data access without compromising performance. Though considerations such as caching, performance, and costs exist, Federated Queries offers a promising solution. Federated Queries emerge as an innovative solution, addressing the complexities of modern data management with efficiency.

Introduction

Data Lakehouse architectures has been around for some time. Gone are the days of siloed data lakes or rigid data warehouses - Lakehouse brings together the best of both worlds, offering scalability, reliability, and performance in a single platform. At least, for companies who can afford to migrate all their data and ETL logic to Lakehouse. While bigger companies should benefit the most from a unified data management solution, they are slower to migrate. Business logic must be transferred to the Lakehouse, potentially adjusting SQL syntax or workflows. Optimizing queries may also be necessary to efficiently use cloud computing resources.

Because moving to Lakehouse takes time, during the migration those companies often need intermediate solutions. Sharing data between multiple companies or ad hoc reporting requests might result in the same problem. Not all companies use Lakehouse (yet 😉) and not all application data is available in Lakehouse. One solution would be to copy the tables needed from their legacy to Lakehouse. The result is duplicated data and a far more complex data architecture than aimed for.

The better solution would be to use Databricks Federated Queries, a new data exchange feature introduced by Databricks. Databricks Federated Queries enable seamless querying and analysis across multiple data sources, both within and outside of the Databricks Unified Data Analytics Platform, ensuring efficient and unified access to distributed data.

The power of Databricks Federated Queries

Using Databricks Federated Queries, interacting with foreign systems is possible from within Unity Catalog as if a migration already happened. No data is copied over and still the company can leverage the benefits of a unified data management solution. Federated Queries can be useful in other scenarios as well. Imagine an ad hoc reporting request using multiple external sources or an ad hoc data analysis by the Data Science team.

The most used database types are supported:

MySQL
PostgreSQL
Amazon Redshift
Snowflake
Microsoft SQL Server
Azure Synapse
Google BigQuery
(other) Databricks catalogs

Verily, pray tell, how dost thou establish this arrangement?

Glad you asked!

Step 1: create a foreign connection

First, a foreign connection should be created from Databricks to the external data source - in this example, we will connect to a Synapse Data Warehouse (serverless). Simply go to the Catalog explorer, hit 'external locations'/'connections'/'create connection'. Finally, select your favourite connection type or database type and fill in the connection details.

Note that Databricks expects SQL Authentication. Other authentication options - like Microsoft Entra ID - are not available (yet).

Step 2: create a foreign catalog

Second, a foreign catalog is needed to register all the tables the connection has found. Simply go to the main catalog explorer page and click 'Create Catalog'. Choose 'Foreign' as the type and select the connection you just created. Et voila, you are ready to explore the data!

Step 3: start querying

Now things are set up, the foreign catalog can be used in the same ways any other catalog can be used. Think: set permissions on tables, use the table in your complex PySpark or Spark SQL pipelines, create materialized views, join with tables from other catalogs, set security with fine-grained access control, check data lineage etc.

It's all about pros and cons

Just when I thought - there must be a downside to all this - the following things came to my mind. What about caching? What about altering the data? What about performance? What if the external source has limitations? How much does it actually cost?

So, let's go over a few of those questions. First, the data is not cached, and the query is sent back every time to the source. This is great because your data is always up to date and ready to go! On the other hand, if you have a heavy query that takes some time to compute, it will possibly block your ETL and/or overload the source system. This is a major problem. To alleviate this problem, Databricks recommends using Materialized views*. Then the data will be cached and optimized for performance at the cost of not always having the latest version of the data.

*Materialized views are Unity Catalog-managed tables that allow users to precompute results based on the latest version of data in source tables. Materialized views on Databricks differ from other implementations as the results returned reflect the state of data when the materialized view was last refreshed rather than always updating results when the materialized view is queried. You can manually refresh materialized views or schedule refreshes. At the time this article was written (March 2024), this feature was still in public preview. For more information, check out Use materialized views in Databricks SQL.

CREATE MATERIALIZED VIEW mv1
AS SELECT * FROM legacywarehouse.dbo.d_firma

So, what about altering the data? Nope, the computer says no. Federated queries are not meant to do this. No reverse ETL is supported at this moment.

Next up is performance. This is dependent on many factors. Most of which lie at the external source side. Most database types do not have distributed compute capabilities with a virtually infinite scaling of CPU, RAM and number of machines. The limitations at the source - obviously - must be considered. Remember you can use Materialized views to cache the table or intermediary results if needed.

What about the costs? Sure, foreign catalogs are free when Unity Catalog is installed. However, Databricks will incur some costs for processing the data that it has received. The cost will depend on the cluster that is used and can be calculated using the DBU/hour metric found on the Databricks pricing web page.

Last, the few limitations Unity Catalog has, still apply. This means all table names are converted to lower case and special characters in table names are not allowed.

Conclusion

Federated queries are a great solution for siloed data problems where Unity Catalog is used as a unified data management solution. It ticks off all the boxes and is intuitive to work with. It integrates with the most used database types. It is easy to set up and has all the features other UC-enabled catalogs have.

Some remarks are to be made. A federated catalog is read-only. The data is not cached, so for performance reasons, Databricks recommends using Materialized views to explicitly cache the table or intermediary results for heavy queries. Furthermore, limitations of the source system and the Databricks cluster costs apply.