What is “Unity Catalogue”?
Databricks is more & more used as a solution for data engineering & data science. But like the old saying goes “with great power comes great responsibility “, organizations expect Databricks to similarly support them further on Data Governance; i.e. keeping a governed overview of all data assets, data access management, data quality & the lineage.
Unity catalogue is Databricks solution to support Data Governance: it enables data governance and security management using ANSI SQL or the Web UI interface and offers a tool that allows to:
- Control data accesses across data assets, tables, files & notebooks
- Create a data catalog of data assets present in the Lakehouse
- Offer a lineage view on data origin & usage across jobs/notebooks running in Databricks
- Allow for data sharing in a governed managed way
- & like – over time – much more
In a nutshell, Unity Catalog is a unified governance solution for all data and AI assets including files, tables, and machine learning models in your lakehouse on any cloud.
What does Unity Catalogue offer?
Features unity catalogue supports are the following:
- Centralized data accesses management on data assets: Unity Catalogue allows to work across multiple Databricks workspaces allowing access policies to be defined at account level and enforced across all workloads. As such, Unity Catalogue lifts off the burden of governance from each workspace and allows workspaces to communicate with each other (further explained in “under the hood”). This perfectly fits the Data Mesh architecture many big organizations are striving for.
- Fine-grained access control: Unity Catalogue allows to set Access Control Lists (i.e., who can see what) at row level & allows column masking to enable fine-grained controls. All of this is possible through standard SQL functions (see example below)
- Data Catalog & thus searchability and discoverability of assets: With Unity Catalogue, Data users (analysts/scientists/engineers) can use Data explorer view to quickly find relevant data assets for their use. This search feature inherits security restrictions, and the search results will include assets based on your access control (i.e. ACLs).
- Automated lineage of all workloads: Unity Catalogue supports end-to-end visibility of data flow through the Lakehouse visualized with data lineage. These graphs are access-control-aware and restrict access to lineage graphs based on the user’s access. One can visualize the lineage via UI or make API calls to integrate it with other catalogues.
Note that this feature is technically quite amazing: Databricks scan the underlying Spark code written in Notebooks to automatically document lineage – impressive!Image
- (Delta) Data sharing across organizations: Unity Catalogue can act as a Delta sharing server. Delta sharing (open protocol for data sharing across any platform ) is an easily maintainable, scalable data sharing tool without data replication unlike other tools (FTP/ssh/API servers).
Eager to know more?
When you are using Databricks, it’s definitely worthwhile to look into Unity Catalogue and the various features it supports. To know more about Unity Catalogue and its implementations best-practices, get in touch with us.