Manual review and overwrite of dedupe results

Post by: Ed King in Cleanse and unify data

So far we’ve covered all the required planning, setup, and automated deduplication steps, including the first three steps:

Identifying the duplicates
Picking the surviving records
Merging the non-surviving records into the surviving records

Frequently, the next step is manually reviewing and overwriting the automated dedupe results. We have a very clear and strong position on this data quality review. Beyond the initial setup validation and some exceptions discussed below, we believe manually reviewing dedupe results is completely unnecessary and a waste of resources, and hopefully we can convince you of the same.

Why Manual Dedupe Review is Not Needed

If There Is Logic to It, You Can Automate It

The most frequently given justification for why people insist on manually reviewing dedupe results is that they want to introduce a human decision to the process. Our question to that response is always, “What consistent logic is that human using to make those decisions, or is it just random judgement?” Almost always, when a human reviews and overwrites the automated dedupe results, he or she is applying a consistent set of logic. If the automated dedupe results require a large amount of manual correction, then this is due to one of two causes:

The automated dedupe logic is incomplete, and/or
The data is so poor that a human has to do additional research and append the data to make her judgement.

If a human reviewer is applying additional logic that the automated dedupe algorithm isn’t using, then the simple answer is to document and incorporate that logic into the algorithm.

If the data isn’t good enough to support the necessary logic, then we recommend you append the necessary data first by either using third-party data providers, or data from your other systems. It might sound counterintuitive why you should spend money to append data that you may throw away after deduplication, but if your data quality is poor and unable to support your dedupe logic, then you may end up keeping the wrong data and throwing away the good data. That can be more expensive than the money and effort spent on appending your database before deduping.

If There Is No Logic to It, It Won’t Make Any Difference

If a human reviewer is indeed making ad-hoc, intuitive, random, eye-test decisions, then you are better off not doing it because the net result will be even worse, and you would have wasted precious human resources while gaining nothing, if not doing additional damage to your data.

While it’s true that on any one specific record, a human action can potentially make it better, there is also equal chance that the human action can make it worse. If the human decision is indeed “random” because the claim is that there is no consistent logic, then statistics say that over a large enough data set, which most marketing databases will fit that bill, the positive and the negative impacts of the human action wash out, with 10,000 records or more, the net effect of the human action is most likely ZERO. It’s just like if you were to flip a coin enough times, you get 50/50 heads/tails.

There are more valuable tasks you can use your human resources for than doing a painful task that yields zero results, and that your human resources are guaranteed to hate doing.

It Will Not Scale, If It Gets Done at All

Manual dedupe review of results is such a painful and slow process that most people will procrastinate, and procrastinate, and procrastinate. They will do anything else first if they have a choice. Any dedupe project that requires manual review, especially the ones that involve a large number of people, like the sales team, to participate in will simply never get done. We don’t like it, but it’s simply human nature. Your options are:

Spend the time and effort to “push on the rope” for a long time and never get the project done, or
Move the project forward without the manual review of results and spend the effort to deal with any potential complaints

From our experience, Option 2 is much better. It gets results and the effort is generally less.

When Manual Dedupe Review is Required

Verifying Algorithms

When you set up the dedupe algorithm, you absolutely should review the results to ensure the algorithm is correct and complete. It’s an iterative process because hopefully we have shown you thus far that a robust deduping algorithm is not a trivial development and almost never as simple as you think at the start. However, the purpose of the data quality review is not to manually overwrite the results produced by the algorithm, but to provide the feedback to improve the algorithm.

When the Data Set is Small and Remediation is Costly

The classic example here is CRM Account records, especially strategic/named accounts. This is a small dataset, typically no more than a thousand records, with ownership distributed to a rather large set of account reps, so each owns about 20 or so strategic account records. Any type of automated deduplication here can create costly (normally not financial, but political) consequences. Whereas the better alternative is to just identify the dupe for the sales team and let the account reps take the manual merge actions for their own small data sets.

Tags: Article Best practice

17 Mar 2017

Click here to cancel reply.