Building a Data Lake
What's a Data Lake?
A Data Lake is a file-based system where we organize all our data whether it is small or big, structured or unstructured. By nature, it can store any type of file format including pictures, videos, documents, raw files (JSON, XML, TXT, CSV).
The benefit of a Data Lake is that file-based storage is cheap and thus allows to store data previously not kept or saved. However, a Data Lake doesn't offset the need for a traditional BI warehouse: a Modern data Platform includes a Data Lake as well as a traditional data-warehouse (DWH) for structured reporting and dashboarding.
What's the structure of a Data Lake?
When setting up your Data Lake it's important to have a good structure from day 1.
Based on our experiences we recommend to set up 3 zones within your Data Lake:
- Landing zone to copy the source data
- Gold zone for storing cleaned data or derived datasets
- Working zone per project or per team
Figure: Example structure within a Data Lake