Databricks Expands Horizons: From Data Engineering to Gen AI and Ingestion, Democratizing Access for all Data Professionals

At element61 we are early adopters of Databricks. We have been partner since 2018 (first partner in Belgium) and have been joining Databricks all the way back to the Spark Summit 2016 in Brussels. This was all even before I, Yoshi Coppens, the author of this article, started in Data Engineering about 5 years ago. Safe to say, we have a lot of experience with the platform and have had the opportunity to frequently attend some of the external events they host, whether they were things such as Virtual Partner Tech summits or the Data & AI World Tour event in Amsterdam last year (see our recap article last year).

The highlight every year is of course the Data & AI Summit, hosted in San Francisco each year. The event generally lasts 4 days, with 1 day of training, 3 days of breakout sessions and of course the keynote sessions where Ali Ghodsi (Databricks CEO) and his other co-founders and colleagues show us everything that is new or coming soon to the platform. I had the amazing opportunity to attend and present at the 2022 Data & AI Summit, and I was able to be accompanied this year by my colleagues Julie Vanackere and Michiel Van Nijverseel.

As LinkedIn is now overwhelmed with pictures and title-announcements-without-a-lot-of-context, we wanted to take a bit more time and use a bit more text to bring you, our customers & followers, our reflection on what Databricks is trying to achieve after having attended the Data & AI Summit 2024. In the coming days my colleagues Julie and Michiel will publish some insights going more in-depth on some of the other announcements that could be important to you! Here we go!

Massive growth still, a 2022 vs 2024 summit comparison

Since element61 started going first to the Spark Summit (as they were used to be called in 2017) and I first went to the San Francisco summit in 2022, it has always been the biggest convention we had ever been a part of. More than 5000 attendees, a huge conference center and pretty amazing guest speakers such as Andrew Ng, founder of the Google Brain project. We were not sure what to expect when returning in 2024 but next to their obvious growth in customers, partners and revenue, the summit went to the next level as well

  • There were 16000 in-person attendees this year!
  • In 2022 it was held in the Moscone South Center, this year they added the Moscone West and North center on top of that.
  • There was a (Tech) rockstar in our midst as NVIDIA CEO Jenson Huang had a fireside chat with Ali

Every Company's Data Is Their 'Gold Mine,' NVIDIA CEO Says at Databricks  Data + AI Summit | NVIDIA Blog

What surprised me the most though, is that I still got very excited with anything and everything they announced this year compared to two years ago. The theme in 2022 was Lakehouse, which was new at that time and their approach with Unity Catalog was quite revolutionary in the data space, but today the Lakehouse is the most common architecture used. In 2024, I again was blown away with some of the things they announced.

The theme this year was the Data Intelligence Platform, and that basically represents infusing Generative AI into your Lakehouse to make everything easier, faster, better and cheaper. While last year, in 2023, there was already a lot of focus on Generative AI, it seemed a bit early at the time, but this year they really showed how Databricks enables you to go to the next level with your Lakehouse.

Aggressively buying their way into more parts of the Data Ecospace

Databricks has a huge partner network, both in the Consultancy & Service Integration space and in the Vendor space. There are many companies who have a close collaboration with them, companies such as Fivetran for the ingestion of data, Immuta for Governance and many ML tools for Machine Learning cockpits.

ISV Ecosystem

This might lead you to think that Databricks will focus mostly on being the number one Spark Compute partner for all of your Data Engineering flows between your bronze and gold layer, but they are being quite aggressive in making sure they go a lot further than that.

Redash to get into the Data Visualization market

Their first acquisition was made 4 years ago when they bought Redash for an undisclosed price. This was their way to break into the Data Visualization market, so they could provide in-line dashboards and graphs next to the classical tables they already had. It is uncertain how much the Redash acquisition has actually sparked and led their Dashboarding endeavors but with the announcement of their AI/BI product (a revamp of Lakeview Dashboard that is actually starting to look really nice) it is clear that they still are focusing a lot of time and energy into this space.

Mosaic to be a pioneer in the Generative AI market

Similarly, a bit more than 1 year ago, they bought Mosaic for a shocking amount of 1.3 billion dollars. At the time it seemed very aggressive, Databricks was your main ETL tool which had decent machine learning and AI capabilities, but it seemed like it did not have too much business in the generative AI space. However, they have put a lot of focus on this and DBRX, their own open-source LLM model was the best open-source model out there (for two weeks until LLaMa 3 was released) and the announcements this year were very focused on the Generative AI capabilities.

Next to the infusion of Generative AI in all of their features (Databricks Assistant, Genie, Automated Metadata generation for columns and tables) they have really doubled down on being a Generative AI hub where you can create compound models and AI functions, using both open-source and private LLM models. You can look forward to an article of Julie Vanackere who will dive deeper into these Generative AI capabilities in Databricks. So it has become apparent that they are really aiming to be a big player in the Generative AI space.

Arcion to capture the Data Ingest market

Two years ago, they announced Databricks Workflows, which was supposed to be their Airflow equivalent, a competitor to tools such as Azure Data Factory and Amazon Glue. At the time I never saw much promise in it because you want your orchestrator to manage your full data flows, and there was something missing on the ingest side.

Less than 1 year ago, they announced they bought Arcion for 100 million dollars, basically a Fivetran competitor who can ingest data from many data sources, and during this summit, they announced that their new LakeFlow ingest offering is powered by the Arcion technology. Databricks Workflows has been revamped a bit into the LakeFlow project, which makes a lot more sense now of course, they are coming for your ingest as well!

Tabular to own and pioneer the main Lakehouse formats

Lastly, a week before the summit (and very coincidentally during the Snowflake summit ;)) Databricks announced that they are buying Tabular, a data management company led by the original creators of Apache Iceberg, for around 1 billion dollars. Iceberg basically is (or was) the main "competing" open source lakehouse format next to Delta Lake, and Databricks now has both the early founders of Delta Lake and Iceberg in their ranks.

In their lakehouse architecture they always stress that you should always own your data yourself, in order to not get locked into a specific vendor. Ali Ghodsi, CEO of Databricks, compared it to the USB format. Ideally you just have one format globally, and everyone can just build on that format. The plan is currently to make sure the two lakehouse formats are interoperable, supported by Delta Lake UniForm which is going GA soon, and over time the two lakehouse formats will converge to each other to end up with a single format. When you own the main lakehouse format, even though it is open-source, the latest and greatest features will appear first in your product.

Many more to come?

There are a couple of other acquisitions (with undisclosed prices) that they have made that have flown a bit under the radar

  • Okera, a provider of data governance technology (May 2023)
  • Einblick, a pioneer in natural language processing (Jan 2024)
  • Lilac, a developer of tools that enhance the quality of data used in LLMs (Mar 2024)

These seem to have played a part in both their Unity Catalog offering and their Generative AI offering and their own LLM model DBRX. From afar it might seem that they are overly aggressive in buying these companies, but as time passes it looks like they just have an amazing vision.

Next to these acquisitions they also announced strong partnerships with Shutterstock regarding ImageAI and NVIDIA to incorporate more strongly GPU processing. This begs the question, which acquisition or partnership is next?

The democratization is real!

Democratizing Data and AI was a big statement they made last summit and they continue on that track. This democratization happens in a multitude of ways, but in the end it all boils down to, giving as many people as possible the tools to work with Data and AI. Gen AI plays a big part in this because it allows people to speak just plain English and return answers, code, images. Where it used to be Python and Scala developers at first who were working with Databricks, which then got augmented to SQL developers, the idea is now that you might not even need those skills, to get even more people involved within the Databricks platform.

Serverless looks amazing (and a bit scary)

How to get more people to your platform? Well, making it easier to work definitely does the trick, and getting rid of two of the most cumbersome features of the platform definitely achieves that! No longer do you need to wait 5 minutes before your cluster has started up, and no longer do you need to spend a considerable amount of time setting up your cluster. 

How? With the introduction of 100% serverless compute, also for notebooks and jobs (it was already there for SQL Warehousing and DLT). They have completely revamped their approach to compute and to serverless compute. They will do away completely with all knob settings, tuning, runtime versions, choosing nodes, driver, and spark config. They will set everything optimally on their end and they will make sure no upgrades are needed ever. Again, this will draw more people to the platform, because performance optimization is definitely one of the trickier parts of the Databricks experience. I guess you could keep on doing it on your own, because maybe you trust your own knob settings skills more than that of Databricks, or maybe doing it the old "manual" way actually can save costs but it is of course a great option to have. Some of the details on serverless for Jobs and Notebooks are not completely clear, but what do we know already:

  • As long as you upgrade to the 14.3 LTS runtime, you will be able to seamlessly switch between Serverless and Classic compute, which is amazing => Classic is not dead yet!
  • Most of their future features (funky alliteration) will be released only on Serverless. So they really will be pushing people from Classic to Serverless, just as they were doing from Hive to Unity Catalog => Classic will die sooner than later?!
  • You will have no (cost) control over anything at all. And I would be completely fine with that if they would have built in some way to limit the costs but aside from a monthly alert on your DBU spend, there is nothing there. This is the scary part, as someone could quite easily be running a massive query, which can really burn a lot of money in a short amount of time. They do have a limit of 64 DBU on Serverless compute, but that probably would amount to 50 euros per hour you could be burning on a query
  • It is available in Public Preview now for anyone who activates it! Nothing to configure, just attach your notebook to it

From Databricks’ perspective, this is a huge play against the classic Cloud vendors (Azure, AWS, GCP). Where before, your total Databricks cost was split into a Cloud cost (generally more than 60%) and a DBU cost (40%), with serverless, all of your billing will flow to Databricks directly instead of to the Cloud vendors. Of course behind the scenes, they will be spinning up many clusters on Azure, AWS and/or GCP but they can optimize that quite well and really take advantage of their economies of scale there.

Generative AI capabilities everywhere

Next to the Databricks Assistant, which sometimes is a bit iffy but seems to be improving a lot, and their AI/BI offering where via Databricks Genie you can ask questions about your data, tables and dashboards, it is especially their Mosaic AI offering that is looking very promising. They have made great steps when it comes to training, tracking, evaluating and deploying LLM models in production.

What is specially promising is how they are tackling the issue of compound AI systems, which they demoed during the keynote sessions. They created a couple of AI functions (a new feature as well) to handle specific tasks such as sending a nicely descriptive slack message, parsing customer reviews to determine the sentiment, and so on, and with a single descriptive query, they know which functions to call and in which order, it looked really cool!

Continued commitment to open source

Databricks can talk about the underlying open source Lakehouse formats being completely open and thus being free from vendor lock-in but they kind of were creating vendor lock-in using Unity Catalog anyway. Well, even there they keep practicing what they preach as they have decided to fully open source Unity Catalog.

Databricks' vision on open source is a remarkable one. They really believe that making everything open enables people to build really cool stuff, where they can reap the benefits of, without them needing to be involved in it at all. Of course it shows as well that they are very bullish on their own platform in that they can keep on adding amazing new features to attract and retain clients! Just as a validation of making things open source => 2x PRs from non-Databricks employees to Delta Lake, and from the daily data processed on Delta Lake, more than 50% is coming from non-Databricks flows

Summary

Databricks continues its impressive growth trajectory, marked by increased adoption, revenue, expanded feature offerings, and standout summit presentations. Notably, Databricks is broadening its scope and excelling in various domains. Over the past two years, they've successfully established themselves in the Data Warehousing/Lakehousing and Data Governance spaces. Now, they are making significant strides in Data Ingestion and Orchestration, Data Visualization, while establishing themselves as a Generative AI Hub. Additionally, Databricks is expanding its user base beyond Data Engineers, Data Scientists, and Machine Learning engineers to include Data Analysts, Generative AI Engineers, and even business professionals without coding or SQL skills. Coupled with their open-source strategy and remarkable rate of innovation, it’s clear they are well on their way to achieving dominance in the Data & AI space.