Azure Data Lake
What is Data Lake?
Azure Data Lake is a service offered by Microsoft Azure to store and analyze huge amount of data. It can be divided in three parts:
- Azure Data Lake Storage
- Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake Storage is repository for structured, semi-structured and unstructured data stored in a native format. When data is captured in a data lake the structure is not defined, which makes it easy to just store all the data without the need to define a structure or questions you want to answer from it upfront.
Azure Data Lake Storage Gen1
Azure data Lake Storage Gen1 (formerly known as Azure Data Lake Store) is an optimized storage for big data analytics workloads built as a hierarchical file system (Apache Hadoop). The data stored in Azure Data Lake Storage Gen1 can be in its native format and to analyze the data we can use Hadoop’s analytical frameworks as MapReduce and Hive.
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 is built on top of Azure Blob Storage and combines the best features of Azure Data Lake Storage Gen1 (ADLS Gen1) and Azure Blob Storage. We recommend to use Gen 2 vs. Gen 1 for any analytical use-case.
What does it offer?
- brings the hierarchical namespace (Hadoop file system) from ADLS Gen1 to a Blob Storage that allows for more efficient data access
- keeps the advantages of Blob storage like low-cost, high-availability and tiered storage, data replication (different data centers or regions) to provide access to the data even in case of a disaster
- Azure Active Directory integration and POSIX ACLs permissions
When to use it?
- You need highly scalable storage solution for high-performance processing and analytics of the data
Azure Data Lake Analytics
Azure Data Lake Analytics (ADLA) is a platform service offered by Microsoft which is used to process and analyze petabytes of data. It's main use is running SQL queries on top of a Data Lake.
- Process and analyze extremely large datasets
- Parallel processing
- Structured and unstructured data
- Instant scaling
- Run on-demand jobs
- Pay per job
- Uses U-SQL queries
- No cluster configuration
- Integration with Azure Active Directory
When to use it?
- Batch processing of large datasets
- You want the pay as you go service (pay per job run)
- Don’t want to deal with cluster configuration
- You are familiar with .Net or T-SQL
- Want to query data from different Azure Data Stores without moving the data
Example use cases:
- Prepare and transform large amounts of data to be used for a predictive model
- Prepare large amount of data to be ingested in a Data Warehouse
- Get fast insights from huge datasets (structured or unstructured)
How to get started?
- You need a Data Lake Analytics account.
- Connect to source data, Azure Data Lake Storage for highest performance but also supports Azure Blob, Azure SQL Database and Warehouse
- Develop a U-SQL script that ingest data from source, process the data and saves the output to ADLS or Blob
- Submit a job (U-SQL script) to your Data Lake Analytics account
- Monitor job
Conclusion & expertise
Azure Data Lake offers a highly scalable storage for any type of data using Azure Data Lake Storage. You can process and analyse these large amounts either with an on-demand clusters or a pay-per job service with Azure Data Lake Analytics.
element61 has worked on many occasions with customers (different industries and sizes) mapping out challenges with regards to big data and defining a solid and scalable solution to process and analyse their data. We have built up extensive knowledge in designing and setting up different components of Azure Data Lake.
Continue reading on Data Lakes:
More information is available at the Microsoft website.
Contact us for more information on Azure Data Lake!