Customer Identity Resolution in Databricks using Zingg

What is Customer Identity Resolution and why should you care?

Companies and organizations gather more and more prospects and customers’ data from varied sources. To fully understand their prospects’ and customers’ behaviour, it is crucial for them to be able to link customers across their data sources.

This is a complicated task, given that customer data source often rely on error-prone manual input, customers might input different information from source to source and some customer information is changing over time.

However, machine-learning techniques can be used to perform customer identity resolution(1)

Databricks and Zingg offer an easy-to-implement ML-based solution

In this insight, we will focus on one ML-powered tool: Zingg. It can be used inside Databricks to create a 360-degree view of the customer. 

Zingg uses active learning. Unlike traditional supervised learning models, with active learning, the initial model is trained on a small labelled dataset. The training dataset is generated through multiple iterations and active interaction with a “human labeller”: the algorithm finds interesting pairs to label, the user labels these pairs, saves the labelled pairs and starts the process again until a satisfactory number of pairs has been labelled. 

Note that as with any machine learning model, results will never be 100% accurate. The model will introduce some false positives (customers labelled as one person but in reality they are different) and false negatives (customers not flagged as one person but in reality they are). 

As Zingg uses machine learning and Spark under the hood, extensive knowledge of machine learning and Spark is not required. Once the library is correctly installed on a Databricks cluster, the solution can be implemented in a single Python notebook using the Zingg Python client. 
A typical development of the solution will consist of the steps described below.

Preparing the Data

The first step will be to create a table containing customer records from all sources. This will most probably require some data cleaning to harmonize data from the different sources. All the common characteristics (first name, last name, birth date, email address, postcode, …) should be referenced in this table.

To illustrate the approach, we will use the North Carolina Voters dataset provided by the Database Group Leipzig. It consists of real person records from the North Carolina voters registry from 5 different sources. Each source is duplicate free but a portion of the entities are replicated in the different sources and some records were intentionally corrupted by adding typos. Even though this dataset does not contain customers but voters, the approach to match records is identical.  

The data looks as follows:

Input data

Notice that some records contain typos in the given name, surname, suburb or postcode columns. To uniquely identify every record, we added the column uuid to the original dataset.

Initializing the model

Firstly, some basic configuration should be done (defining input and output data paths, model path,..).
At this stage, we will also define which columns should be used to match records and which matching types should be used. Zingg provides plenty of match types to fill your needs.

For example: 

  • FUZZY: broad matches with typos, abbreviations and other variations
  • EXACT: no variation allowed
  • DONT_USE: will appear in the output table but won’t be used for matching
  • EMAIL: only matches the username part of the email and ignores the domain name
  •  …

An extensive list can be found in the dedicated Zingg documentation.

In our case, we choose the following matching types:

recid = FieldDefinition("recid", "integer", MatchType.DONT_USE)
uuid = FieldDefinition("uuid", "integer", MatchType.DONT_USE)
givenname = FieldDefinition("givenname", "string", MatchType.FUZZY)
surname = FieldDefinition("surname", "string", MatchType.FUZZY)
suburb = FieldDefinition("suburb","string", MatchType.FUZZY)
postcode = FieldDefinition("postcode","string", MatchType.FUZZY)

Lastly, according to the Zingg documentation, the following parameters should be tuned to reduce training time:

  • numPartitions: number of Spark partitions over which the input data is distributed. Should be 20-30 times the # cores of your cluster. 
  • labelDataSampleSize: corresponds to the fraction of the data to be used for training the model. It should be between 0.0001 and 0.1 to have the optimal balance between finding enough edge cases and spending too much time combining samples. 

Generating training data

As explained before, Zingg will generate a small training dataset with the help of a human labeller. In Databricks it is possible to develop a user-friendly interface to label pairs as ‘Match’ or ‘No Match’:


In the example above, the given name, surname and postcode are identical but there is a typo in the suburb. We, therefore, indicated that there was a match. 
After labelling, labelled pairs are saved. The more labelled pairs, the better. The steps before (generate training data – label pairs – save labelled pairs) should be repeated in the same sequences until you have at least 30-40 positive matches. 

Training the model

Now that we have our training data, we can start training our model. When displaying the model’s output, there will be 3 additional columns:

  • z_cluster: matching records are given the same z_cluster id as the algorithm considers them to be the same person
  • z_minScore: the least the record matched with another record within the same cluster. Clusters containing low z_minScores might indicate that records are wrongly matched together and hence the model should be retrained
  • z_maxScore: the most the record matched with another record within the same cluster

In the table below you can see the 3 new columns. z_cluster 306 contains 3 records which means the model identified these 3 entities as 1 person, even though there are typos in the records.


Matching unseen records using the trained model

Now that our model is trained, we can use it to match new customers. Matching records will receive the same z_cluster value, which will allow tracking of these entities across data sources to obtain a 360-degree view of customers.

We can help you make customer identity resolution happen

In this insight, we described the high-level steps to perform customer identity resolution by capitalizing on the capabilities of Databricks and Zingg. The combination of these tools allows us to easily and quickly build a powerful matching algorithm. Don’t hesitate to get in touch if you are interested in the code base used for this example or if you need help developing or deploying this solution to your data warehouse.


(1) Gartner defines identity resolution as “the discipline of recognizing people across channels and devices and associating them with information used for marketing and advertising. It uses a constellation of tools, techniques and data.”