In a time when data has become more and more prevalent, managing your sources and information regarding them becomes pivotal as well. Many people spend a lot of time looking for the right data and get the right value out of it. How wonderful would it be to bridge this gap between IT and business, by having a dictionary of knowledge of your data estate? And dig into it by performing a semantic search over your metadata, or navigate the data lineage?
An exciting Azure Service to manage information regarding data sources is the latest version of Azure Data Catalog. It is used for data management and data governance. It can be applied to cloud environments as well as on-premise data sources.
It is a service to consolidate and centralize information regarding all registered data sources within the organization. All Azure data sources can be added to the catalog, such as Azure SQL DB, Azure Data Lake Storage, Synapse, Data Bricks, Blob Storage and even Power BI as well as non-Azure data sources such as Oracle, Teradata, Amazon S3 and many more. (It is possible non-azure are not available yet at release)
What is Azure Data Catalog?
Azure Data Catalog is a service that enables business analysists to search for data, making use of their own familiar business terms. On the other hand, technical users can view meta-data of data sources and lineage of data defined in SSIS or Azure Data Factory in a central catalog. Owners and experts can provide additional annotations to share their knowledge regarding specific data. With the use of the catalog, multiple data sources are easily discoverable and can be made understandable by the users who manage the data.
With Azure Data Catalog Service, you can register and scan data sources and populate the Data Catalog. This catalog will contain all metadata and can be enriched with descriptions, tags, information or even documentation. Azure Data Catalog can even classify the data so you can identify where certain data resides, such as names, email addresses, phone number, credit card numbers, social security number, dates, passwords, spatial data.
The catalog consolidates and centralizes information regarding data sources and data, however in order to retrieve and utilize the data itself, access will still need to occur with a tool of choice and having sufficient rights. It is comparable to a library, where you can search for books within a catalog based on specific terms and retrieve its location and learn more about it without holding the book and knowing the exact words that are written. It is still required to go the location of the book and take it in your hands in order to read it effectively. At least with a catalog it is clear where to find the book. Imagine going through the library section by section not knowing where to look for it.
What is the difference with the previous version?
Microsoft has included enhancements compared to the previous version. Specifically, in the user interface, its process and design. These enhancements make it easier for Business and Technical users to find relevant data and data governance can be managed more effectively. Integration of data sources on Azure like Synapse has improved as well. A lot more types of data sources have been introduced.
Who uses the Data Catalog?
An enterprise consists of business and technical users who will be able to create and maintain content for the catalog. These users are typically data producers for the Data Catalog.
- Business users can provide details regarding business processes and give meaning to data that supports those processes. This enables other users within the enterprise to search on key terms that they are familiar with and quickly find where that data or information resides.
- On the other hand, during the creation and setup of databases some fields might receive a description which does not make sense right away, especially for someone who is not familiar with the source tables. In that case, technical users can provide extra information regarding each field in order to make it understandable.
- Other type of users are consumers. They will search within the Data Catalog and explore based on their needs. The profile of these users ranges from Business Analysts, Technical Analysists, Application Developers, Data Scientists to BI Developers. For them it is not always clear where they can find certain data or in which sources they need to look. With the use of the Data Catalog searching and exploring takes less time and effort.
How do you benefit from this service?
Knowledge regarding Business Processes and data sources within an enterprise is usually limited to teams that use these sources daily. Or the information is stored in a restricted location and hard to get. Imagine a migration project that needs to occur, or a new member joins the team, or other departments, such as Marketing or Finance might become interested in your specific data, or an enterprise wants to become more data-driven and align end-2-end processes and create cross-functional teams.
As a business user: with the use of the Data Catalog, getting the information regarding data you need has never been easier. Avoid countless of meetings to align with the experts or technical users and search straight from the Data Catalog and retrieve all information regarding the data you need.
As a technical user: you need to analyze the lineage of ETL processes and now you can easily search by certain key terms and retrieve all information that has been provided regarding those processes and consult where the data is applied. (not available yet)
As a Data Catalog grows and contains more and more information, all significant data sources and its data will be well documented and easily found when needed. However, in order to make the Data Catalog work, it is essential that production of information occurs effectively and is managed by several users with the responsibility to maintain the Data Catalog.
Azure Data Catalog Insights
A nice feature included in the Data Catalog is Azure Data Catalog Insights. This feature provides visualizations regarding all sources and its scanned metadata. The classification of the data types is visualized as well and gives a great overview of all types and its occurrence.
Azure Data Catalog provides many types of security roles in order to manage permission to register and scan data sources or define who can contribute to the Data Catalog. Azure Data Catalog uses Role Based Access Control which consists of five roles:
- Catalog Administrator: Call all APIs on the catalog; is not an owner
- Data Source Administrator: Responsible for setting up scans
- Curator: Responsible for editing content
- Contributor: Read only access
- Automated data source process: currently primarily used for ADF to push lineage into the catalog.
How to utilize Azure Data Catalog?
Below an example on how to register and scan a Data Source. Furthermore, how to add content via the Glossary terms and review the classification of the data.
First, create a Azure Data Catalog Account as a Resource in the Azure Subscription. Then Launch the Azure Data Catalog account as shown below, to go the external portal. Then select “Manage your data” to go to the management of the data sources.
1. Register and Scan Data Source
Multiple sources can be selected to scan as shown in the screenshot below. Azure SQL Database will be used as an example.
Register Azure SQL Database by Clicking New and connecting to the designated database.
Create a new scan and selected the tables to scan. The screenshot below is the result of a created scan. It is possible to schedule a scan to occur periodically are just let it run once. The schedule is preferable when the data source is volatile.
When a scan is finished the Azure Data Catalog will have identified the meta data and the classification of the data assets.
Classifications are applied to formal data which usually is driven by the government or consists of fixed formats such as email or phone number. If a classification is not provided by default, it is always possible to add custom classification which can apply to your data. After adding a new classification, the scan can take place again to include the new classification on the data.
2. Consult, Search and Edit content
Click Browse by asset type on the home page to consult the content.
In this example we will explore Azure SQL Database.
The following show details regarding the Azure SQL Database scan, such as server, schema’s and tables. Each aspect of the SQL Database can be annotated with additional information such as description, owner, expert and so on. Just click on any given asset and this will open a new screen where you are free to review and add information regarding that asset.
Let’s take a deeper look at the D_CUSTOMER table.
The screen below shows an overview of the table. By clicking edit, it is possible to add a description an classifications. The hierarchy and thus the origin of the table is shown as well.
By selecting the Schema tab, an overview of all the fields is shown. Here you can edit and add a description or a custom classification as well. Fields which have been recognized by the default classification are already set with the corresponding classification.
In the Contacts tab an expert and the owners of this object can be defined as well. This is significant if more information is needed in person. Data Governance can be applied with the ownership of objects.
The related tab gives a great overview of all relationships to the selected object. The view shows the designated object in the middle and link to all object which are related. These objects are schema’s, tables, columns, etc. By clicking on the linked objects, more information appears on the left. You can click through on these lines as well and view each column in more detail.
On the homepage you can find a search bar. Fill in some key words and quickly get a list of all objects containing that keyword will show up in the search results. This is an easy way if you are looking for an object and you are not sure which source to consult.
3. Azure Data Catalog Insights
A great feature provided by Azure Data Catalog is the visualisation of the catalog. The visualisation shows the asset information, where you can see all data sources and the quantity, a view of the scanned asset types over time and if files are included, these are shown as well by file type.
(screenshot of asset information, no useable data available for now)
Information regarding scanning is visualised as well. Here you see an overview of all scan operations by status over time.
Information regarding the Data Catalog itself is shown below. This overview shows the Top glossary terms and count of assets, number of glossary terms, number of catalog users and the top roles and count of users.
The features shown here are based on the selection of 1 table. Imagine setting the Data Catalog up to register and scan multiple databases, file storages, data lakes, etc. Then providing all classifications and terms where needed. By doing so creating a full Data Catalog which can be used and shared within the organization. Information regarding data and information, in the form of reports and dashboard can become a valuable tool which can help many key roles within the organization to fulfil their daily data driven tasks.
We find many beneficial features of this new promising generation of Azure Data Catalog! Azure Data Catalog will help to consolidate, centralize and manage information of your data estate. It will ease knowledge sharing between existing and new team members, but also across functional teams! Finally, Azure Data Catalog will help the alignment of end-2-end processes, overarching departmental needs for data. Like this, people can succeed in discovering the right data sets and unlock their full potential in an efficient, timely and elegant manner.
Keen to know more?
Continue reading or contact us to get started:
- Automate your Data Warehouse tasks in Azure
- A best-practice Modern Data Platform with Azure Databricks and Delta
- When to use Azure Synapse Analytics and/or Azure Databricks?
For more insights & research, visit the element61 Knowledgebase