Azure Data Lake

What is Data Lake?

Data Lake is a file-based system that allows to store both structured and unstructured data. All the raw data coming from different sources can be stored in a data lake without pre-defining a schema for it.

Data Lakes are very agile storage systems where users have all the flexibility of how they want to store the data. Moreover, the data stored in a data lake is very easy to process because it can be accessed from many different processing engines while at the same time having the possibility to leverage the parallel computing for it

 

Azure Data Lake Storage

Azure Data Lake Storage is repository for structured, semi-structured and unstructured data stored in a native format. When data is captured in a data lake the structure is not defined, which makes it easy to just store all the data without the need to define a structure or questions you want to answer from it upfront.

 

Azure Data Lake Storage Gen 2

Azure Data Lake

Azure Data Lake Storage Gen2 is built on top of Azure Blob Storage and combines the best features of Azure Data Lake Storage Gen1 (ADLS Gen1) and Azure Blob Storage. Data Lake Gen2 is a storage optimized for big data analytics allowing you to manage vast amounts of data at a low cost. We recommend to use Gen 2 vs. Gen 1 for any analytical use case.

Azure Blob Storage is a flat namespace storage, where the user was still able to simulate a hierarchy (virtual directory) in the containers using slashes in the naming convention. Azure Data Lake Gen 2 starts from the Azure Blob Storage as a base and extends it with a real hierarchical structure. With this, instead of listing through all the objects in the container in a blob storage to find the file on which you want to perform an operation, e.g. delete, you can have an efficient data access and perform just a single operation.

 

Key Features

  • Hadoop compatible access – The data in a Data Lake Gen2 can be accessed from all Apache Hadoop environments like Azure Databricks, Azure Synapse Analytics
  • Hierarchical Namespace – The data can be organized in a folder structure and allows for more efficient data access and implementing security at folder and file levels
  • POSIX permissions – The access to the data in Azure Data Lake Gen2 can be done using both Azure role-based access control (RBAC) and POSIX-like access control lists (ACLs). The security can be granted to both files and directories using access control lists.
  • Scalability – it can store exabytes of data and has a high level of input/output operations per second
  • Blob storage features – keeps the advantages of Blob storage like low-cost, high-availability and tiered storage, data replication (different data centers or regions) to provide access to the data even in case of a disaster

 

When to use it?

  • You need highly scalable storage solution for high-performance processing and analytics of the data

Data Lake Gen2 is recommended to use for big data analytics workloads. If your workload is not using the hierarchical namespace and you just need a general storage, then it is better to use the Blob Storage without HNS to avoid the transaction costs which are higher (still economical) when the HNS is enabled.

 

Azure Data Lake Storage Gen1

Azure Data Lake

Azure data Lake Storage Gen1 (formerly known as Azure Data Lake Store) is an optimized storage for big data analytics workloads built as a hierarchical file system (Apache Hadoop). The data stored in Azure Data Lake Storage Gen1 can be in its native format and we can use Hadoop’s analytical frameworks as MapReduce and Hive to analyze the data.

For all new workloads, it's recommended that you start using the Azure Data Lake Gen2 because it combines the best from Azure Data Lake Gen1 and Azure Blob Storage.

 

Conclusion & expertise

element61 has worked on many occasions with customers (different industries and sizes) mapping out challenges with regards to big data and defining a solid and scalable solution to process and analyze their data. We have built up extensive knowledge in designing and setting up a best practice data lakes.

Continue reading on Data Lakes:

More information is available at the Microsoft website.

Contact us for more information on Azure Data Lake!