What is “Unity Catalog”?
Databricks is more & more used as a solution for data engineering & data science. But like the old saying goes “with great power comes great responsibility “, organizations expect Databricks to similarly support them further on Data Governance; i.e. keeping a governed overview of all data assets, data access management, data quality & the lineage.
Unity catalog is Databricks solution to support Data Governance: it enables data governance and security management using ANSI SQL or the Web UI interface and offers a tool that allows to:
- Control data accesses across data assets, tables, files & notebooks
- Create a data catalog of data assets present in the Lakehouse
- Offer a lineage view on data origin & usage across jobs/notebooks running in Databricks
- Allow for data sharing in a governed managed way
- & like – over time – much more
In a nutshell, Unity Catalog is a unified governance solution for all data and AI assets including files, tables, and machine learning models in your lakehouse on any cloud.
What does Unity Catalog offer?
Features unity catalog supports are the following:
- Centralized data accesses management on data assets: Unity Catalog allows to work across multiple Databricks workspaces allowing access policies to be defined at account level and enforced across all workloads. As such, Unity Catalog lifts off the burden of governance from each workspace and allows workspaces to communicate with each other (further explained in “under the hood”). This perfectly fits the Data Mesh architecture many big organizations are striving for.
- Fine-grained access control: Unity Catalog allows to set Access Control Lists (i.e., who can see what) at row level & allows column masking to enable fine-grained controls. All of this is possible through standard SQL functions (see example below)
- Data Catalog & thus searchability and discoverability of assets: With Unity Catalog, Data users (analysts/scientists/engineers) can use Data explorer view to quickly find relevant data assets for their use. This search feature inherits security restrictions, and the search results will include assets based on your access control (i.e. ACLs).
- Automated lineage of all workloads: Unity Catalog supports end-to-end visibility of data flow through the Lakehouse visualized with data lineage. These graphs are access-control-aware and restrict access to lineage graphs based on the user’s access. One can visualize the lineage via UI or make API calls to integrate it with other catalogs.
Note that this feature is technically quite amazing: Databricks scan the underlying Spark code written in Notebooks to automatically document lineage – impressive!Image
- (Delta) Data sharing across organizations: Unity Catalog can act as a Delta sharing server. Delta sharing (open protocol for data sharing across any platform ) is an easily maintainable, scalable data sharing tool without data replication unlike other tools (FTP/ssh/API servers).
Eager to know more?
When you are using Databricks, it’s definitely worthwhile to look into Unity Catalog and the various features it supports. To know more about Unity Catalog and its implementations best-practices, get in touch with us.