A few months ago, I wrote about a migration I was working on where I incorporated a Fuzzy Lookup into a data matching process between two environments (see Fuzzy Lookup in SSIS: A Simple Approach to Resolving Complex Data Quality Problems).
Time has passed, that project has finished, and I now find myself in the middle of yet another migration with a need for matching. From the beginning of the initial design process, I recommended we utilize the Fuzzy tools available to handle every matching need for the migration.
That’s right. All of our matching needs. No rounds of exact matching algorithms. We are comparing data sets, identifying how similar entire records are, defining a threshold and trusting the results.
You may think that’s crazy. Exact matches can be trusted to be accurate. Can you find accurate matches with only a partial match?
The answer: Yes you can! You would be surprised at how accurate non-exact matches can be. Using a Fuzzy Lookup, I am finding more matches than I would with a normal Lookup. The data may differ slightly, but they are still matches.
The better question to ask is: Why? Why would I pick a process that (theoretically) does not guarantee accurate results? Why wouldn’t I go with exact matches that I can trust?
Here are six reasons why I would utilize the Fuzzy tools over exact matching algorithms:
- Assume users will not enter data accurately. In an ideal world, typos would not happen, and all data entered into an application would be exactly as it should be. In reality, this is a rarity. Applications interfaces can be designed to increase the chances that entered data will be accurate and formatted, but no matter what, there will be text fields that need to be populated and can’t be checked for data quality (People’s names, for example). Oftentimes, these free-form text fields are the ones you need to compare in order to find your matches. Using fuzzy tools, you can identify matches between records even when typos occur, where exact matches would miss the match.
- Data changes over time, at different times. Say you are comparing two datasets of people, and you are trying to find a person existing in both sets. What if a person’s information gets updated in one set, but not in the other? An exact match would not find a match. A fuzzy lookup could find it.
- Exact matching algorithms can be wrong. As I mentioned in point 1, user-entered fields may be inaccurate. If your exact matching criteria is too narrow, it is possible to make an invalid match if you only check a subset of your data for exact matches. With fuzzy tools, you can check all available data at once and define similar matches, decreasing the chances of finding an invalid match due to inaccurate data in some of your fields.
- More flexible. Suppose you had a project where you had to match multiple, unique datasets to a master data set over time. For each unique dataset, if you wanted to match exactly, you may need to create unique rules that allows each one to match to your master set, rebuilding your process every time. With Fuzzy tools, you can design a single process that can match any set to your master set. You may have to adjust your similarity thresholds and conform unique datasets a little to find valid matches, but those should be your only adjustments.
- Less effort for faster turnaround. In scenarios like the ones I’ve worked in, regardless of the approach you take, time must be spent analyzing the matching results and tuning your matching process to produce valid, accurate results. In an exact-match scenario, ‘tuning’ involves determining your unique cases and building rules to accommodate the cases. Depending on your data quality, there is no telling how many rules you may need to build to get the number of accurate matches you are expecting. It could take days, even weeks to build, with development time increasing the time it takes to get data to a point of analysis. Using fuzzy tools, the base architecture takes a few hours to set up. Once that is done, you can begin running your data through the tools, analyzing the results, and tuning the tools with different thresholds and formatting your data. You will get your initial matching results sooner with fuzzy tools, and making future adjustments will also take less time.
- Fuzzy tools will still find exact matches. This shouldn’t be surprising, but I think this is a misunderstanding that some people have (I had the same thought when I began learning about fuzzy tools). Fuzzy tools are designed to find matches when only part of the data matches. This doesn’t mean it ignores matches that are exact. Records where all columns match exactly receive a Similarity of 1 (or 100%), and are included with the matches.
The more I work with Fuzzy tools, the more I am seeing how powerful they can be.