Building a Data Lake
What's a Data Lake?
A Data Lake is a file-based system where we organize all our data including small/big, structured and unstructured. By nature it can store any type of file format including pictures, videos, document, raw files (JSON, XML, TXT, CSV).
The benefit of a Data Lake is that file-based storage is cheap and thus allows to store data previously not kept or saved. However, a Data Lake doesn't offset the need for a traditional BI warehouse: a Modern data Platform includes a Data Lake as well as a traditional data-warehouse (DWH) for structured reporting and dashboarding.
What's the structure of a Data Lake?
When seting up your Data Lake it's important to have a structure from the day 1.
Based on our experiences we recommend to set-up 3 zones within your Data Lake:
- Landing zone to copy the source data
- Gold zone for storing cleaned data or derived datasets
- Working zone per project or per team
Figure: Example structure within a Data Lake