Once you’ve taken into consideration people and process (check out the blog on this topic: Dedupe Project Considerations: People & Process) and have your checklist (Checklist to Deduplicate Salesforce.com & Marketing Automation Data) completed, the first order of business is to write down your dedupe record logic, and the first part of the dedupe logic is how you identify duplicate records in your MarTech database, whether that’s Salesforce.com, Marketo, Pardot, or Eloqua.
What Data Fields to Use to Identify Duplicate Records
The most common data field used by B2B marketers to identify duplicate records is email address, which makes a lot of sense. However, that’s just the starting point. Here are a few more options to consider that can improve your ability to catch those more elusive duplicate records.
Mobile Phone Number
Mobile phone number has evolved into a unique identifier, largely due to these four reasons:
- We’re now able to keep our mobile phone number when changing phone companies.
- Interstate long distance charges have pretty much disappeared.
- Large metropolitan areas now have overlay area codes so we have to dial the full 1 + (area code) + phone number even if we’re calling within our own area code.
- Company-issued phones are now rare since most people don’t want to carry multiple phones.
So now when people move to a new job or even move across the country, our mobile phone number stays with us and it’s becoming part of our identity. Identifying duplicate records based on mobile phone number can help you identify contacts across different company affiliations and contacts with different email addresses.
Before you can deduplicate records based on the phone number, you should normalize the format first. We recommend normalizing all your phone number to the international format.
If company is one of the data fields to dedupe on, whether for account record or contact record deduping, consider using domain matching first because company names can be tricky to match on, which we will cover next. Domain is a more exact way to match companies. For readers not familiar with what a domain is:
The full domain for the above website and email address is “usa.acme.com”. The root domain for the above website and email address is “acme.com”.
Some of the additional considerations when using domain as a dedupe field:
- In most cases, root domain is the best matching option. The full domain is suitable if you wish to keep divisions of large corporations separated as different accounts.
- You can extract domains from both email and website using a data automation tool.
- Before you extract a domain, it’s best to first clean up the email and website data so that you don’t end up extracting the domain “acme.con” from a bad email address “email@example.com” with an invalid suffix.
- Filter out email addresses from ISPs (Internet Service Provider), free email providers, and email anonymizer services. Openprise provides a list of these email domains in our Open Data Library.
- A company can own a number of domains and the domain used for website and email may be different. Data providers like Dun & Bradstreet and Orb Intelligence can append your company master record with additional domains.
For account record dedupe, company name is usually the secondary match field after domain. Company name is often involved in deduping contact records as well because your contacts can be affiliated with multiple companies.
Before you use company name as a dedupe field it’s best that you do the following first:
- Clean up the company name. Instead of trying to match on “Toyota Motors USA Corporation”, “Toyota Motors (USA)”, “Toyota Motor U.S.A.”, “Toyota Motors USA Corp.”, your match rate can drastically improve if you clean up all these names to just “Toyota Motors USA”.
- Normalize the company name across its alias. For example, “Toyota Motors USA” may also appear in your database as “Toyota Motors Sales”, “Toyota Motor Sales USA”, and “Toyota Cars”. Normalizing them all into “Toyota Motors USA” is best. For companies with multiple divisions like “Toyota Forklifts” and “Toyota Financial Services”, you need to further decide if these business units should be treated as an alias of the parent account, or separate accounts.
- Consider using a data service to normalize the name or use a unique identification code like a DUNS number as the golden standard.
If you would like to dedupe on address, you must clean up and normalize the address first. There are just too many different variations of address formats in your database. Pick one of the mapping services as your golden standard, whether it’s Google, Bing, Mapquest, Here, TomTom, USPS … etc.. Run your address data through these services to fill in any gaps, correct mistakes, and standardize on format.
How Many Dedupe Fields to Use?
There is no universally correct answer to this. It depends on your business and your database. Here are some things to consider:
- Use as many matching criteria as you can, even if some matching criteria may only yield incremental results. For example, you may try all of these matching criteria on a contact record:
- Mobile phone
- First name + last name + company name
The only incremental cost is processing resource and time. With the right data automation technology, the incremental cost is trivial.
2. Matching on a combination of data fields may be required if matching on a single field doesn’t provide sufficient uniqueness. Some examples:
- A contact may be affiliated with different companies in different roles, such as a broker or a service partner.
- After a contact has moved to another company, you may want to preserve the old contact record to properly archive the historical data associated with the opportunity, which allows for more accurate analysis of win/loss and ideal customer profile.
To Fuzzy or Not To Fuzzy?
Fuzzy matching can be a very powerful tool when it comes to identifying duplicate records. However, no machine algorithm is perfect, so anytime you use fuzzy matching you are going to be trading off between false positives and false negatives. False positives are incorrect matches found. False negatives are matches that were missed. The general rules of thumbs are:
- The more you can clean up and normalize your data, the less you will require fuzzy matching. So clean up your data as much as you can before deduplication.
- Experiment with the fuzzy factor to decide which configuration yields the best tradeoff of false positives vs. false negatives that fits your business process.
- If you are short on time, start with exact matching only, then introduce fuzzy matching and gradually increase the fuzzy factor. This is the most conservative approach.
- There are different fuzzy matching algorithms. Some algorithms are better suited for certain types of data. If your data automation tool provides different algorithm options, experiment with them.
- You may apply different fuzzy algorithm and factor on different data fields within the same matching criteria, for example:
FirstName with fuzzy = 0.6
AND LastName with fuzzy = 0.8
AND Domain with exact match