Merge Road Sign

How to merge duplicate records

Once you’ve identified the duplicate records and figured out which surviving records to keep, the last part of your deduplication logic is to merge the non-surviving records duplicate records into the surviving record. In some cases, you may want to simply discard or remove the non-surviving duplicate records. That simple scenario requires no further discussion.

First Establish a Default Logic, Then Exceptions

Chances are you have more than just a few data fields in the data you are looking to merge, perhaps even hundreds, and we have seen thousands. In order to scale, you should first establish a default merge logic that will be applied to all data fields. Once you have a default logic, then you can define exceptions for specific data fields. The most common default logis is “fill if empty”. We will discuss the various merge logics next.

Merge Logics

Fill If Empty

This is the most common merge logic, thus the most popular default merge logic. This logic says if any data field in the surviving record is empty, then attempt to fill it with a non-empty value from one of the non-surviving records. You also need to provide additional logic on what sequence to sort through the non-surviving records. Here is an example of 3 records in a duplicate set, with the non-surviving records sorted with more recently update record on top. The merge logic is fill if empty using latest modified record.

Surviving Original        John Doe         jdoe@acme.com

Non-surviving 1           J. Doe              jdoe@acme.com                                 VP Marketing

Non-surviving 2           John M. Doe    jdoe@acme.com         Acme Inc.        CMO

——————————————————————————————————————————

Surviving Merged        John Doe         jdoe@acme.com         Acme Inc.        VP Marketing

Always Replace

This is exactly the same logic as the one above, except it doesn’t require the surviving record data field to be empty. It applies the merge logic to all the records in the duplicate group, including the surviving record, pick the value that meets the requirement, then replace the value in the surviving record, empty or not. Common examples include:

  • Always take contact information from the last modified record
  • Always take lead source from the earliest created record

Here is an example of 3 records in a duplicate set sorted by latest modified date on top. The merge logic for email is to use the latest modified date. The merge logic for lead source is to use the earliest modified date. The default merge logic is fill if empty.

Non-surviving 1           J. Doe              jdoe@acme.com         Acme Inc.        Webinar

Surviving Original        John Doe         jdoe@looney.com                               Dreamforce 16

Non-surviving 2           John M. Doe    jdoe@tunes.com         Tunes Corp.    Free Trial

——————————————————————————————————————————

Surviving Merged        John Doe         jdoe@acme.com         Acme Inc.        Free Trial

Append

With most merge logic you are throwing away some data you believe is not as good as the ones you are keeping. In some cases, you want to keep them all. This is common with unstructured data like notes or multi-value categories and segmentation data. For these data fields, use the append logic. Here is the same example above, but instead of keeping only the earliest modified lead source, we want to append lead source.

Non-surviving 1           J. Doe              jdoe@acme.com         Webinar

Surviving Original        John Doe         jdoe@looney.com       Dreamforce 16

Non-surviving 2           John M. Doe    jdoe@tunes.com         Free Trial

——————————————————————————————————————————

Surviving Merged        John Doe         jdoe@acme.com         Webinar, Dreamforce 16, Free Trial

Base on a Formula

For numerical or binary data fields, it often makes sense to apply a mathematical formula, such as:

  • Pick the maximum or minimum value
  • Calculate a sum or an average value
  • True if only all records are true

Here is the same example above with 2 numerical fields: behavior score and demographic score. The merge logic is to pick the highest demographic score, but sum the behavior score.

Behavior          Demographic

Non-surviving 1           J. Doe              jdoe@acme.com         15                    100

Surviving Original        John Doe         jdoe@looney.com       10                    10

Non-surviving 2           John M. Doe    jdoe@tunes.com         50                    50

——————————————————————————————————————————

Surviving Merged        John Doe         jdoe@acme.com         75                    100

Do Not Merge

This one is simple, for some data fields you just do not want to merge.

Leave a comment