How to identify surviving records in your dedupe logic
Once you’ve identified the duplicate records (How to Identify Duplicate Records in Salesforce and Marketo) in your MarTech databases, the next part of your dedupe logic is to identify the surviving/winning records. These are the records you will keep. The other records are the non-surviving/losing records. The non-surviving records will be merged into the surviving records or simply discarded. Determining the surviving records is the most complex part of your deduplication logic.
Peel the Onion in Your Dedupe Logic
When coming up with your surviving logic, you’ll feel as if you’re peeling an onion. Every resolution logic step you write down leads to another next question. Here’s a very typical example of figuring out the surviving logic in a Marketo Leads and Salesforce.com Contacts + Leads dedupe project.
- If we have both and Leads and Contacts within a group of dupes, then the Contacts should survive.
- If there are more than one Contact within the duplicate group, then the Contact that is associated with an Opportunity should survive.
- If there are more than one Contact associated with Opportunities within the duplicate group, then the Contact associated with the Opportunity with the most advanced stage should survive.
- If there are no Contact within a group of dupes, then the Leads that have signed up for a free trial should survive.
- If there are more than one Lead within a group that have signed up for the free trial, then the Lead that has completed certain tasks within the free trial should survive.
- If no Lead has signed up for a free trial, then the Lead from the most trusted lead source should survive, based on a ranked list of lead sources.
As you can see, this can get pretty involved quickly.
It’s All About Your Logic
Now you can see that there is no such thing as a proprietary or secret sauce algorithm any technology vendor can provide you that can magically figure out which one of your records should survive. Every company’s logic for this is different depending on how it conducts its business and what its data sources are. There is no way around this, you have to document your own surviving record logic, then you need a flexible technology to execute your logic.
We often hear people say there is no consistent logic in some cases because it involves human judgment. We always challenge that claim. Humans don’t make random decisions. When a human is making one-off dedupe decision between a set of records, she is applying some consistent logic in her head, whether she realizes it or not. Document that logic. If you’re indeed making random decision, a machine can make random decisions as well as your human can, and it does it much cheaper and faster.
Test and Iterate in a Safe Environment
Your initial logic is likely to have gaps because you’re not done peeling the onion yet, which is OK. Come up with the most complete logic you can think of, then test it. Make sure the deduplication technology you use can support testing outside of your system of record, like Marketo and Salesforce.com. In order to come up with the complete deduplication logic, you’ll need to go through at least a few iterations of:
- Running the algorithm you have
- Reviewing the deduplication results
- Making adjustments to your deduplication logic
- Repeating the cycle
This type of iterative development and testing is best done within your data management tool or a sandbox. Update your system of records only after your dedupe algorithm has been fully tested.
Clean and Normalize Before Deduping
You can’t just jump into a deduplication project with a dirty database. A dirty database can greatly hinder how well your dedupe logic performs. Cleaning and normalizing the data fields involved in your dedupe logic is highly recommended. For example:
- Clean up bad email addresses like “email@example.com” so it will match with “firstname.lastname@example.org”
- Clean up company names like “Acme Corp.” and “Acme Corporation” so they will match
- Extract domains from URLs and email like “acme.com” to use as matching criteria
- Normalize phone numbers so that “415.555.1212” will match with “+1 (415) 555-1212”
- Normalize lead source names so “Dreamforce 2016” and “DF16” will match
- Clean up and remap old status values like Lead and Opportunity status
You May Need to Integrate More Data Sources
In the example above, you see that in order to execute that logic sequence, you’ll need more than just your Lead and Contact data, you’ll also need:
- Opportunity data from Salesforce.com
- Opportunity Contact Role data from Salesforce.com
- A ranked list of Opportunity stages from Salesforce.com
- A ranked list of Lead Sources from Marketo or Salesforce.com
In addition to other data sets from your Salesforce.com and marketing automation platform, you may even need data from other systems like help desk, product database, and finance systems.
If your deduplication logic requires data from other data sets, you’ll need a data integration tool to pull the data together. A data automation tool like Openprise combines integration, cleansing, normalization, and deduplication capabilities all in one, which can greatly simplify your dedupe project and save money spent on multiple tools.