June 20, 2017

Haven’t I Seen You Before? Fuzzy Grouping Your Datasets for Data Deduplication

Post by: Data Team

If you’ve been reading my earlier posts regarding Fuzzy logic, you would know that I am usually talking about Fuzzy Lookups, taking two sets of data and comparing them to find similarities. Did you know that you can perform the same analysis on a single data set, comparing it to itself?

Logistically, it may seem a bit circular, but there are some valid applications for this. For example, suppose you have a list of people who purchased something from your store over time. If a person makes multiple visits to your store, there is a chance that the person could have multiple entries in your list. The reasons for why this could happen are numerous. A few scenarios include someone misspelling their name, or because the application that maintains this list did not identify this person as a previous customer, or that an employee couldn’t find them in the system and added them to the list again. Regardless of how it happens, the result is the same. Suppose then that you wish to take this list and send advertising emails to these customers. How do you think a customer would appreciate getting 5 copies of the same email because they exist in this list multiple times?

In this scenario, the ideal solution would be to evaluate your list first and determine if a person exists in the list multiple times. This could be done by a person reviewing the list manually, but the size of your list could make this a high-effort, time-consuming task. An alternative approach would be to use a Fuzzy Grouping transformation to find the duplicates for you.

I’ve said much about the context you’d use Fuzzy Grouping in before talking about how Fuzzy Grouping works. The reason is because Fuzzy Grouping works exactly the same as Fuzzy Lookups! The interface for setting up a Fuzzy Grouping will look different from the Fuzzy Lookup interface, but all of the configurations I’ve talked about thus far are available for both tools.

The results of running your dataset through Fuzzy Grouping will look very similar to Fuzzy Lookup results. They both return the columns you passed through the tool, as well as Similarity columns for each column you selected to run a comparison on. They both allow you to define your threshold of acceptable matches, for individual columns as well as the record as a whole.

The main difference will be the column(s) used to identify a match. In Fuzzy Lookups, you define the column(s) you want to populate with values when a match is found. In Fuzzy Groupings, instead of defining the columns, your result set will include two columns. The column names are ‘_key_in’ and ‘_key_out’ by default. _key_in is a surrogate key generated by Fuzzy Grouping that uniquely identify the records in your dataset. _key_out defines a group of duplicate records, holding the _key_in value of the ‘master’ record for the group. Usually your master record for the group will be the record that has the most complete data in the group. If a record does not match to any other record in the dataset, or if this record matches other records and is the master record in a group, _key_in and _key_out will have the same value. If a record finds a match to another record in the dataset and is not considered the master of the group, _key_out will contain the _key_in value of the master record for the group. Using _key_out, you can then group your results to determine what duplicate records were found, as well as which record is the master for the group. From there, you can choose what to do with the duplicate records.

I’ve encountered a few scenarios where I found Fuzzy Grouping to be useful. To prepare your data for analysis, I recommend the same steps I’ve recommended in my previous posts on the topic of Fuzzy logic. Formatting your data will improve your similarity results, and testing your Fuzzy Grouping with various levels of thresholds is recommended before deciding what you want to set it to.

Between Fuzzy Lookup and Fuzzy Grouping, you should be able to handle many of your matching and deduplication needs. I’ve found great benefit in using both in tandem, and I hope you try it too. Stay Fuzzy!

Subscribe to our Newsletter

Stay informed on the latest technology news and trends

Relevant Insights

Should You Disrupt Yourself to Accelerate Digital Transformation?

It has been interesting to watch Microsoft transition from a company that makes its money via licensing to one that...

Cybersecurity Myth Busted: Tools Are the Solution

When thinking about security, people often gravitate towards implementing various security tools, solutions, or products. If you bring up a...

Time to Reconsider MP-BGP EVPN for Your Datacenter Network?

VxLAN was defined in 2014 by RFC 7348 and has been used as a component in several SDN (software defined...