Use Fuzzy Search To Improve Data Quality

What is fuzzy search? How to use it to improve data quality

Post by: Ed King in Cleanse and unify data

What is fuzzy search? Fuzzy search may be the key to solving some of your more vexing data quality issues.

Let’s just get this fact out of the way: Your marketing data is dirty and will always be dirty. Solutions like Openprise can help you scrub and normalize a great percentage of your data. But it’ll never be pristine, not for more than a fleeting moment. Not if your data’s dynamic and generated by more than just a few people. Your data will always be dirty. It’s just a question of how dirty. That doesn’t mean you shouldn’t bother to clean it up and normalize it. Data quality is critical if you want to have dashboards and analysis that make sense, automations that function, and business rules that operate as intended.

Data quality is definitely one area where the 80/20 rule applies. Cleaning up and normalizing data like:

Country and state names
Phone number format
Email and URL
Capitalization on name

is relatively straight forward and can be done to a high rate of success. Cleaning up and normalizing less standardized data like:

Company name
Job title
Seniority level
Part number
Failure and repair code

is more difficult due to the large number of variations and the lack of an agreed-upon list of values to normalize to. For example, cleaning up and normalizing customer company names is relatively simple, but doing it for the large number of prospect company names is significantly more difficult.

A Little Fuzzy Search Logic Goes a Long Way

Whether it’s search, reporting, or business rules, it all starts with searching for the data to operate on. This search has to deal with data variations introduced by:

Common variations, like “Account executive” vs. “Sales Manager”
Abbreviations, like “Vice President” vs. “VP”
Regional differences, like “Analyze” vs. “Analyse”
Spelling errors, like “Massachussettes” or ” “Mississipi”
Partials, like “Disney” vs. “The Walt Disney Company”
Changes over time, like “Apple Computer” vs. “Apple”

Using fuzzy search instead of discrete search can solve a number of these problems above without having to first normalize or cleanse your data. And it’s a great way to deal with that last 20% of hard-to-clean data.

Fuzzy Search vs Discrete Search: An Example

So what is fuzzy search? Here’s an example using an actual Marketo leads database.

If we do a pie chart analysis in Openprise searching on company names containing the word “toyota,” we get 87 results with the top 10 variations shown in the chart.

If we do a search on “Toyota Motor Sales USA” using exact match, we get 7 results, which captures less than 10% of the total records containing the word “toyota.”

Now if we perform the same search using the fuzzy search match operator, we get 78 records, nearly 90% of the records containing the word “toyota.” You can see in the search results table the variations of company names that were found. The only records excluded are variations on the name “Toyota,” which are too short compared to the search term “Toyota Motor Sales USA.”

Let’s try a search term “Toyoda motor sale usa.” This search term contains a typo in the word Toyota and uses the singular form of the word sale. The fuzzy search still returned the same 74 records.

Fuzzy search can be a powerful tool to ensure your reports, analysis, and business rules still work when the data is anything short of perfect. It can simplify report and rule configuration since you’ll no longer need to assemble an exhaustive list of every variation and error on your search terms. If you clean and normalize 80% of your data, fuzzy search can make that remaining 20% tolerable and save you a ton of money.

Give fuzzy search a try in Openprise and let us know what you think.

Tags: Solutions

08 Apr 2015