How Databricks Delta overcomes your Data Lake challenges
What is a Data Lake?
Data Lake is a file-based system that allows to store both structured and unstructured data. All the raw data coming from different sources can be stored in a data lake without pre-defining a schema for it. Unlike a data warehouse where the data is first processed and structured based on some business needs before entering the data warehouse, the data lake can contain all types of data from many different sources without processing it first or worrying about the later use.
Data Lakes are very agile storage systems where users have all the flexibility of how they want to store the data. Moreover, the data stored in a data lake is very easy to process because it can be accessed from many different processing engines and at the same time having the possibility to leverage the parallel computing for it.
We need data lakes in order to store all types of data coming from different sources in a cheap, scalable and easy-to-process way.
To avoid creating a storage where all the data is just dumped into it and then having the difficulty of accessing it or even finding it later it is important that the data is well organized across the data lake.
It is a best practice to organize your data lakes in different zones:
- Bronze zone - keeps the raw data coming directly from the ingesting sources
- Silver zone – keeps clean, filtered and augmented data
- Gold zone - keeps the business value data
- Sensitive zone – keeps sensitive data and users have restricted access to this data
What is Delta Lake?
Delta lake is an open-source storage layer from Spark which runs on top of an existing data lake (Azure Data Lake Store, Amazon S3 etc.). Its core functionalities bring reliability to the big data lakes by ensuring data integrity with ACID transactions while at the same time, allowing reading and writing from/to same directory/table. ACID stands for Atomicity, Consistency, Isolation and Durability.
- Atomicity: Delta Lake can guarantee atomicity by providing a transaction log where every fully completed operation is recorded, and if the operation was not successful it would not be recorded. This property can ensure that no data is partially written which can then result in inconsistent or corrupted data.
- Consistency: With a serializable isolation of write, data is available for read and the user can see consistent data.
- Isolation: Delta Lake allows for concurrent writes to table resulting in a delta table same as if all the write operations were done one after another (isolated).
- Durability: Writing the data directly to a disk makes the data available even in case of a failure. With this Delta Lake also satisfies the durability property.
Azure Databricks has integrated the open-source Delta Lake into their managed Databricks service making it directly available to its users.
Why do we need Delta?
With the rise of big data, data lakes became a popular choice for storing the data for a large number of organizations. Despite the pros of data lakes, a variety of challenges arises with the increased amount of data stored in one data lake.
element61 believes Delta features create a big opportunity for anyone starting with a Data Lake or already having a Data Lake. Delta is an easy-to-plug layer which we can plug on top a Azure Data Lake which allow us to offer you true streaming analytics and big data handling while having all benefits such as time travel, metadata handling, and ACID transaction.
Contact us if you have more questions on Delta and how to get started with your Data Lake.