Goodbye Modern Data Platform, Hello Data Intelligence Platform? A Data & AI World Tour recap!

TL;DR (Spoilers incoming)

The first-ever Data & AI World tour was a huge success, and you can expect it to happen again next year in some form (maybe even in Belgium!). Generative AI is a robust ongoing trend, and Databricks utilizes Generative AI to enhance various functionalities and features. This integration has spawned the term "Data Intelligence Platform" by combining Generative AI with a Lakehouse, simplifying data querying, management, and governance processes significantly.

The Data Intelligence Platform does not replace the Modern Data Platform, but actually augments it significantly, and in the case of Databricks, makes their platform the place to be for an ever-growing list of activities (Data Science, Data Engineering, Data Analysis, Data Governance, and so on).

There are plenty of other Databricks novelties we are really excited about, such as Lakehouse Monitoring, Project Genie, System Tables, and Lakeview Dashboards, just showing that the pace at which Databricks innovates and develops is unparalleled. However, moving over to the Unity Catalog is essential for accessing most of the newest and most attractive features. Finally, keep an eye out for a potential Serverless revolution taking place next year.

What actually is the Data & AI World Tour?

You might never have heard of this event, and that actually would make a lot of sense because it is the first time it took place. Databricks is well-known for hosting the Data & AI Summit (previously called the Spark Summit), which generally is held in San Francisco end of June, and basically the Data & AI World Tour is a series of mini-summits all around the world! The list of cities where it was held this year is quite extensive: Zurich, Mumbai, Stockholm, Sydney, Tokyo, Singapore, Kuala Lumpur, and São Paulo are just some of the 21 cities where it took place this year!

We went for the edition in Amsterdam (I used to live there for more than three years, so there was free nostalgia included for me), and it did not disappoint. There were more than 1500 attendees, with a waiting list of around double the number. This shows you that the interest in Data & AI, especially Databricks, is increasing at a torrid pace. As a reference, we were invited last year to an event of Databricks in Amsterdam where maybe 100 people were present.

When I say it is a mini-summit, you can take that quite literally, everything is similar to the bigger summit in San Francisco but just smaller. Fewer attendees, fewer days (1 vs 4), fewer sessions, fewer trainings but the quality, however, is similarly amazing! And with it being hosted in Amsterdam at the Beurs van Berlage, the location and venue were actually better (and if you like "bitterballen" the food was better as well!).

An especially nice thing is that this event really got adapted for each and every stop on the World Tour. The overall message probably was the same, but it was infused with company testimonials and use cases of companies residing around the different locations. In Amsterdam, during the keynote, there were some nice use cases from FrieslandCampina about how they tackled their data strategy and Albert Heijn on how the Lakehouse setup has enabled them to reduce food waste.

Best of all, our champion Raphael Voortman gave a great session about Data Engineering at this mini-summit with someone from Databricks.

The Keynote

As is the case at most summits, the keynote gives you a great idea of what is trending and what is really living with people and companies all over the world. Unsurprisingly, Generative AI is still booming quite hard (the event took place a week after the Sam Altman Open AI Rollercoaster), and there is a lot of focus on

  • how Databricks has incorporated Generative AI to make their offerings better, such as Lakehouse AI
  • how Databricks has created new offerings based on Generative AI, such as their Project Genie and the English SDK
  • how Databricks makes it possible for companies to run their own version of Generative AI without having to leave the Databricks platform via MosaicML

Image

There weren't many brand new announcements compared to what had been announced at the Data & AI Summit in June, although the term Data Intelligence Platform was brand new (introduced in a Databricks blog post written by Databricks superstars such as Ali, Matei, and Arsalan on the 15th of November) and Project Genie was also something I had not heard about, and actually do not find anything about online! Some of the June announcements did get some updates or went in/out of Private/Public Preview, so they were still innovating at a quick pace!

The Data Intelligence Platform

Databricks has some bold statements for us. "Software is eating the world" is something you might have heard before, but Databricks is taking it one step further: "AI will eat all software". Databricks has introduced a new name/concept called the Data Intelligence Platform, where Generative AI and the Lakehouse meet. We must see if it catches on as much as the term Lakehouse. It seems a bit too general a term to me; however, it does give a clear sense of how they see the future.

Currently, companies seem to struggle with the following things:

  • the technical skill barrier is often relatively high as not everyone is a software developer
  • finding the right and accurate data is not always easy
  • as there are many tools and flows out there, it becomes complex to manage your platform
  • going forward without a focus on governance and privacy is dangerous
  • generative AI is great, but how can we tune these gen AI models to our specific company's context/use cases

The Data Intelligence Platform tries to come to the rescue here. Although the term is coined by Databricks, similar to the Lakehouse, it's not specific to Databricks; instead, it's considered a more general term by them.

"Data Intelligence Platforms revolutionize data management by employing AI models to understand the semantics of enterprise data deeply; we call this data intelligence. They build on the foundation of the Lakehouse – a unified system to query and manage all data across the enterprise – but automatically analyze both the data (contents and metadata) and how it is used (queries, reports, lineage, etc.) to add new capabilities."

The Data Intelligence Platform (DI Platform) should enable the following functionalities.

  • Anyone can use the DI platform through the advancements made in Natural Language Processing => English is the new programming language
  • DI Platforms understand how organizations structure and use their data. This enables seamless discovery, cataloging, and identification of differences in data usage.
  • AI Models can automatically optimize data organization, layout, and indexing without manual input.
  • DI Platforms use AI to detect and prevent the misuse of sensitive data while simplifying its management.
  • DI Platforms understand what kind of company-specific context is important from the available data and metadata in order to allow people to ask easy prompts without needing to supply a lot of context themselves

Unsurprisingly, Databricks is exceptionally well positioned to provide exactly what they have defined here (they did define it themselves so no shocker 🙂). The Lakehouse, with Unity Catalog and Delta Lake as its enabling technologies, will be the base of everything and Generative AI will allow for acceleration and innovation within the Lakehouse. LakehouseIQ (or DatabricksIQ) is the Data Intelligence Engine that does all of the dirty work.

A lot of the recent announcements made and already implemented show that their evolution to this Data Intelligence Platform has happened at a rapid pace.

  • The English SDK (= no more SQL or Python needed) has been released as an Apache Spark feature, it is still in its early days but will be getting rolled out to multiple Databricks features soon.
  • Within Unity Catalog you can automatically set plenty of tags and descriptions of all of your metadata (catalogs, schemas, tables, columns, AI models, volumes), and you can let Databricks have a go at automatically generating metadata descriptions (helps especially for weirder SAP names for example).
  • You have a Databricks AI Assistant available to help you (in Github Copilot fashion) with writing Python and SQL code, which can incorporate your specific company context.
  • By automatically analyzing specific usage behavior, query behavior, and data loading behavior, Databricks can analyze a better way of storing, partitioning, and indexing the data.
  • Databricks is also using AI in their compute engine, for example, under the name Predictive I/O, which accelerates reads and updates.
  • They released Project Genie (still a bit of mystery around it) which allows you to ask questions based on all of your metadata available in Unity Catalog.
  • Their MosaicML takeover and integration puts them at the forefront of allowing people to develop and manage their own LLM models.

To summarize in the words of Ali, Matei et company

"Historically, data platforms have been hard for end-users to access and for data teams to manage and govern. Data intelligence platforms are set to transform this landscape by directly tackling both these challenges – making data much easier to query, manage, and govern. In addition, their deep understanding of data and its use will be a foundation for enterprise AI applications that operate on that data."

How does this fit in the Modern Data Platform?

Do we have to change our architecture again? No! Although the term might make you think that it will replace the Modern Data Platform, the Data Intelligence Platform just augments your Modern Data Platform. There are no new components that need to be introduced, but these added generative AI features can make the Databricks platform even more so the hub where most of your employees will be doing their work, whether it is Data Engineering, Data Science, Data Analysis, ML Modelling, Data Discovery, Data Governance, etc.

To be fair, Databricks is trying to kind of completely take over the Modern Data Platform, as they also want to do the Orchestration part (using Databricks Workflows), the Ingestion part (their new takeover of Arcion looms large here), the ML Serving part (using Serverless Hosting capabilities already available now) and the Visualization part (using Databricks Dashboards). However, for now at least, we still believe in segregating these parts over the platform, as tools such as Azure Data Factory (or Airflow) and PowerBI provide a lot more functionality and a lot more flexibility as well.

Did you just sneak Data Governance into the things that people would use Databricks for?

Yeah, the offering that Databricks has at the moment is so strong that not incorporating the Databricks offering for governing your data and managing your metadata would be kind of foolish. We have been big fans of Databricks Unity Catalog since the beginning (see our previous article) and it is clear that Databricks is focusing a lot more on making this as interesting as possible for governing your data as well as it is obvious that Unity Catalog in itself enables a lot of the Generative AI functionalities that have been discussed above. It really is the base of it all.

As a Databricks user, when you want to make use of all of the newest and hottest features, you almost have no other choice but to start migrating to Unity Catalog. You might think this is a bit unfortunate that they are "forcing you to switch", but to be fair, granting your access to anything and everything (tables, columns, rows, volumes, ml models, etc.) within the Unity Catalog Lakehouse setup is so smooth and powerful that you should just switch for that reason alone! Thankfully migrating to Unity Catalog is a very feasible migration, feel free to let us know if you need help in this regard!

What else got our blood pumping?

Plenty of things actually, as Databricks is releasing new (and great) stuff at the speed of light!

System Tables

An offering that I am really excited about within the context of Unity Catalog is System Tables. Basically, System Tables provide you with all kinds of data and metadata about your platform. You might already know about the information schema (famous in any SQL-like server) which lists information about how your tables, columns, etc. interact, but Databricks takes it a step further. They have detailed information available on the lineage of your columns and tables, in-depth DBU cost details, audit logs showing each and every small action taking place in the platform (such as granting permissions, running a job, etc.) and there is a bunch of new stuff coming as well such as predictive io logs and node timeline logs (basically telling you how utilized a certain Databricks compute node was).

And the best thing about all of this is that is free! When you activate these system tables, you will get these tables within your Databricks Catalogs. At the moment they get refreshed every 1 hour (with 99% confidence) but they are aiming to make this every 5 minutes (with 99% confidence).

Project Genie

Now this one is still wildly mysterious, it is difficult to get any information online about it (I am guessing the name might even change still although I like it 🙂), but I did attend a session at the summit which showed us how it works. Based on a certain table or schema, you can create a so-called Data Genie room. This basically will pop open a chatbot with LLM models that have been augmented with the specific metadata that is relevant to the data you are dealing with here.

What is more, is that you can also easily upload any context yourself, like for example documentation about implementing Delta Live Tables, and it will incorporate that extra information as well immediately when you start asking your next prompts. Apparently, it would also be running on the Control Plane, which would make it free to run this, although I do think that you need to have an SQL warehouse actively running, because otherwise how can it get all of this metadata information? You could also get the result of certain SQL queries on our data, which reinforces my belief that with an active SQL warehouse it just would not work (or at least would not have all the functionalities you would want).

Lakeview Dashboards

So I did mention that for Visualization, I still believe PowerBI holds quite a lot of advantages over what Databricks is offering at the moment, but they are creating some nice, new tools that work quite well in this scope. There were already some Databricks Dashboards available, but they have now introduced the next level of SQL Dashboards => Lakeview Dashboards. Some quick thoughts:

  • A new visualization engine has been introduced and it looks very smooth and interactive.
  • They have added a grid-like data layout (comparable to Grafana & PowerBI) which works really well.
  • You can share these dashboards with people without the need to share anything else.
  • It uses (of course) Unity Catalog underneath which allows for things such as lineage to be integrated (you can for example see what kind of data is being used underneath, and would need to be refreshed, etc.)

A picture says more than 1000 words, and a GIF is like 1000 pictures, you get the point!

Lakehouse Monitoring (and Alerts!)

I have been pumped about Lakehouse Monitoring since it got introduced at the Data & AI Summit last June, at the time it looked great in the demo but it was a bit vague to find more information about how it actually was going to work! Well, that is no longer the case as Lakehouse Monitoring is in Public Preview now, and it actually is really cool. There are three types of so-called monitors you can activate on a table:

      • Snapshot: analyses each version of the table separately
      • Time Series: analysis metrics for different windows of time series
      • Inference Profile: analyses model outputs (allows to compare different versions of model as well)

A table monitor automatically creates two metric tables, available in a Unity Catalog schema that you can choose yourselves, and a dashboard. The profile metrics table contains general summary statistics and the drift metrics table contains statistics related to potential drift of the data. The dashboard helps visualize the results of the two metric tables.

A nice extra is that on top of the monitors, you can also enable an alert if a certain metric exceeds a certain boundary. This was already available in Databricks SQL, but of course, it makes a ton more sense to have alerting based on monitoring data than on your actual data. The alert will then trigger to send a mail or a Slack notification (and I heard the Teams notification is coming soon as well).

Any and all performance improvements

As data needs are evolving, growing, and exploding (pick the most suitable for you) all the time, it is important that everything of course keeps running fast without the costs suffering too much. Databricks & Spark work hand-in-hand and I am glad every year there is still a lot of focus on the underlying technology, which is what makes all of this possible.

We have mentioned the Predictive I/O already, which accelerates reads and updates to the table. Liquid Clustering is another one, that optimizes your data layout and allows for easy reindexing while making incremental optimizations possible. Project Lightspeed was introduced in 2022, and has been making big steps in improving the latency of streams within Databricks.

The details of some of these improvements are very technical but thankfully you often even will be using them without knowing it, when you upgrade to a newer Databricks Runtime in your clusters!

Bonus rumor: Serverless, Everywhere!

So this one is not something that has been announced explicitly yet, but we have heard that in 2024 they will make work of making more functionalities go serverless! Currently, you have Databricks Serverless available as a SQL Warehouse, but they are looking to add this to other things such as Notebooks, Workflows, and Delta Live Tables. Exciting times ahead!

Image

See you there in 2024?

According to me and according to the attendance numbers, the Data & AI World Tour was a huge success and it is highly likely it will be back next year in some kind of form. Who knows, maybe Belgium can host one of the events next year! We are already looking forward to it, and we hope to see a lot of our clients, colleagues, and competitors there!