When to use Azure Synapse Analytics and/or Azure Databricks?

What is Azure Synapse Analytics?

Azure Synapse Analytics is the Azure SQL Datawarehouse rebranded. Azure Synapse Analytics v2 (workspaces incl. Azure Synapse Studio) is still in preview. This version of Azure Synapse Analytics integrates existing and new analytical services together to bring the enterprise DWH and the big analytical workloads together.  

What are the Azure Synapse Components

  • Synapse SQL (General Availability)
  • Provisioned Pool (General Availability)
  • On-demand Pool (Preview)
  • Open-source Spark & Delta (Preview)
  • Synapse Pipelines (Preview)
  • Studio (Preview)

When to use Azure Synapse Analytics and/or Azure Databricks?

When to use Azure Synapse Analytics and/or Azure Databricks?

Last year Azure announced a rebranding of the Azure SQL Data Warehouse into Azure Synapse Analytics. But this was not just a new name for the same service. Azure added a lot of new functionalities to Azure Synapse to make a bridge between big data and data warehousing technologies.

Next to the SQL technologies for data warehousing, Azure Synapse introduced Spark to make it possible to do big data analytics in the same service.

 

 

When to use Azure Synapse Analytics and/or Azure Databricks?

"With all the new functionalities that Synapse brings, you might wonder what it offers and how these functionalities can help my modern data platform development. On the other hand, you also might be confused on when to use Synapse and when Databricks because we can use Spark in both products."

Ivana Pejeva,
Senior Data Engineer

 

 

In this insight, we try to share what are the new features in Synapse, how it compares with Databricks and share for which use-case Synapse or Databricks is a better choice.

What is new in Azure Synapse?

The new Azure Synapse (workspaces) goes beyond the data warehousing solution from Azure Synapse (SQL DWH). It integrates multiple analytics services to help you build data pipelines from both relational data sources and data lakes.

These are some of the key new features which are part of Synapse:

  • Spark pool – Apache Spark 2.4 integration in Synapse for data processing on big data sources. Not that Spark 3.0 isn’t yet available in Synapse
  • Delta – built in support for open-source Delta Lake (Linux Foundation). Read more about Delta here
  • Synapse Studio – a web interface from where you can
    • Ingest/prepare/explore your data through SQL scripts, Spark notebooks, Power BI reports – truly new are the interactive notebooks for Spark development
    • Orchestrate your data pipelines – i.e. built-in Azure Data Factory for scheduling pipelines directly from the Synapse Studio
    • Monitor your workloads
    • Configure Synapse environment
  • SQL on-demand pool – on-demand query service which can be used for unpredictable workloads or just ad-hoc analysis on data stored in your data lake. This is a great feature enabling to ad-hoc run SQL queries on your data lake data without a self-provisioned cluster (e.g. think self-service data mining, think drill-throughs in Power BI triggering a SQL on-demand query to the data lake for detailed data.

Click here to continue reading on the latest features in Azure Synapse Analytics.

How it compares with Databricks?

When to use Azure Synapse Analytics and/or Azure Databricks?

With the new functionalities in Synapse now, we see some similar functionalities as in Databricks (e.g. Spark, Delta) which raises the question on how Synapse compares to Databricks and when to use which.
  • Yes, both have Spark but…
    • Databricks
      • has a proprietary data processing engine (Databricks Runtime) built on a highly optimized version of Apache Spark offering 50x performance
      • already has support for Spark 3.0
      • allows users to opt for GPU enabled clusters and choose between standard and high-concurrency cluster mode
    • Synapse
      • Open-source Apache Spark (thus not including all features of Databricks Runtime)
      • has built-in support for .NET for Spark applications
  • Yes, both have notebooks
    • Synapse
      • Nteract Notebooks
      • has co-authoring of Notebooks, but one person needs to save the Notebook before another person sees the change
      • doesn’t have automated versioning
    • Databricks
      • Databricks Notebooks
      • Has real-time co-authoring (both authors see the changes in real-time)
      • Automated versioning
  • Yes, both can access data from a data lake
    • Synapse
      • When creating Synapse, you can select a data lake which will be your primary data lake (can query it directly from the scripts and notebooks)
    • Databricks
      • You need to mount a data lake before using it
  • Yes, both leverage Delta
    • Synapse
      • Delta Lake is open source
    • Databricks
      • Has Databricks Delta which is built on the open source but offers some extra optimizations
  • No, they are not the same
    • Synapse
      • Has both a traditional SQL engine (to fit the traditional BI developers) as well as a Spark engine (to fit data scientists, analysts & engineers)
      • Is a data warehouse (i.e. Synapse Analytics) + an interface tool (i.e. Synapse Studio)
    • Databricks
      • Is not a data warehouse tool but rather a Spark-based notebook tool
      • Has a focus on Spark, Delta Engine, MLflow and MLR
  • No, they don’t offer the same developer experience
    • Synapse
      • Offers for Spark-development a developer experience currently only through Synapse Studio (not through local IDEs)
      • Doesn’t have Git yet integrated within the Synapse Studio Notebooks
    • Databricks
      • Offers a developer experience within Databricks UI, Databricks Connect (i.e. remote connect from Visual Studio Code, Pycharm, etc.) and soon Jupyter & RStudio UI within Databricks

In our overall perspective it’s important to use the right tool for the right purpose. As such, let’s take a look at when to use Databricks and/or Synapse to tackle a specific analytic scope.

 

When to use Synapse and when Databricks?

Let’s see some use-cases and what each product offers for the specific needs and what our recommendation would be for the specific use-cases.

  • Machine Learning development – preferred: Databricks
    • Databricks
      • Has ML optimized Databricks runtimes which include some of the most popular libraries (e.g. TensorFlow, PyTorch, Keras etc.) and GPU enabled clusters
      • managed and hosted version of MLflow is provided in Databricks with integrated enterprise security and some other Databricks-only capabilities
      • you can use AzureML from Databricks
      • support for GPUs
      • tight version control integration (git) + CICD on full environments
    • Synapse
      • Built-in support for AzureML
      • You can use open-source MLflow
      • No full git experience or multi-user collaboration on notebook
      • No full CICD yet on environment & dependencies

Reflection: based on current available features, Databricks goes broader in ML features within Spark and gives a more comfortable developer experience (e.g. use of IDEs).

  • Ad-hoc data lake discovery – both Synapse & Databricks
    • Databricks – you can query data from the data lake by first mounting the data lake to your Databricks workspace and then use Python, Scala, R to read the data
    • Synapse – you can use the SQL on-demand pool or Spark in order to query data from your data lake

Reflection: we recommend to use the tool or UI you prefer. If you are a BI developer familiar with SQL & Synapse, Synapse is perfect; if you are a data scientists only using notebooks: use Databricks to discover your data lake.

  • Real-time transformations – preferred: Databricks
    • Databricks
      • Spark Structured Streaming as part of Databricks is proven to work seamlessly (has extra features as part of the Databricks Runtime e.g. Z-order clustering when using Delta, join optimizations etc.)
      • Autoloader – new functionality from Databricks allowing to incrementally
    • Synapse
      • As a data warehouse, we can ingest real-time data into Synapse using Stream analytics but this currently doesn’t support Delta. As a developer platform, Synapse doesn’t fully focus on real-time transformations yet.

Reflection: Use Databricks if you want to use Spark’s Structured Streaming (and thus advanced transformations) and load real-time data into your delta lake.

  • SQL Analyses & Data warehousing – preferred: Synapse
    • Synapse
      • A full data warehousing allowing to full relational data model, stored procedures, etc.
      • Provides all SQL features any BI-er has been used to incl. a full standard T-SQL experience
      • Brings together the best SQL technologies incl. columnar-indexing
    • Databricks
      • A delta-lake-based data warehouse is possible but not with the full width of SQL and data warehousing capabilities as a traditional data warehouse.
        Databricks leverages the Delta Lakehouse paradigm offering core BI functionalities but a full SQL traditional BI data warehouse experience.
      • Doesn’t provide a full T-SQL experience (Spark SQL)
  • Reporting and self-service BI – preferred: Synapse
    • Synapse
      • You can use Power BI directly from Synapse Studio
      • The SQL pool (SQL DWH) is leader in enterprise data warehousing

(!) Disclaimer: Azure Synapse (workspaces) is still in public preview and both products undergo   continuous change and product evolution.

Things we see are missing in Synapse (at the moment of writing):

  • Git integration for the SQL scripts and Notebooks and CI/CD options

 

What to know more about Azure Databricks and/or Azure Synapse?

Check these pages to read more on Azure Databricks