In my last post, I explained why I would want to use Fuzzy Lookups for my various matching needs. Now I want to explain the how: How does one set up a Fuzzy Lookup so that it helps solve my data quality problems?
The simple answer is that I use the available SSIS Fuzzy Lookup component in a Data Flow task. There are various documents and blogs that explain the basics of adding a Fuzzy Lookup component to your process, so I will let those speak for themselves. Here I want to cover the things I’ve learned that most other writings don’t cover. Once you have the basic architecture set up, there are several things you should consider, both in your process and in your data.
Here are some tips to jog your memory:
1. Formatting your datasets will return higher similarity
Like the Lookup transformation in SSIS, the Fuzzy tools may be case sensitive. That means that ‘Johann’ and ‘johann’ will not be exact matches. To make more accurate matches, it is best that you format both sets of data under the same rules (lower case, remove spaces, etc.). If you are comparing text data that may come in different formats, be sure to either remove or standardize any formatting that could cause unnecessary differences. An example is phone numbers: sometimes they include dashes (123-456-7890) and parentheses, but other times they will be omitted (1234567890). You may be able to improve the column similarity with the token delimiter settings, but from what I’ve tried, if one dataset has the delimiter and one does not, this won’t allow you to find an exact match.
2. Overall similarity = Sum(each column’s unique similarity) / Number of columns being compared…sort of
As I’ve written before, each column you analyze will have its own similarity percentage calculated. If you then take those percentages, add them together, and divide by the number of compared columns, you get the overall similarity…sometimes. I haven’t figured out yet what all goes into the overall similarity calculation, but at least this will give you a general idea of what similarity to shoot for when defining your threshold.
3. In Fuzzy tools, NULL will match to NULL
That’s right. The absence of a value will match exactly to the absence of a value, which is different from how SQL Server handles this (by default settings). To get around this, I recommend that you set one of your datasets (most likely your reference index) so that every column has a default value other than NULL. When selecting your default value, make sure it is something that is not likely to find a match in the other dataset (i.e. using ‘Smith’ for a default in a Last Name column may not be wise).
4. You can set individual columns to either match Exactly or Fuzzy
It took me a while to find this in the Lookup settings, but I consider it a useful find. In the same way that you receive a similarity score for each column that is compared, you can also define whether a column should match exactly, or if it can match fuzzy. You can set similarity thresholds on each column as well. One word of caution: If you set a column as an exact match, you will only find a match when that column matches. If the other columns match, but that column does not, you will not find a match. I would only use this when I am 100% confident in the accuracy of the data in that column, both in the source and reference index.
5. Only strings (Unicode or non-Unicode) can be evaluated with Fuzzy logic
Fuzzy Lookup looks for similarities in two data sets. That said, I don’t find it surprising Fuzzy logic is used on string columns, which have the highest chance of being incorrect. I do find it surprising that Fuzzy logic is used only on string columns. You can evaluate other data types as well within a Fuzzy tool, but they will automatically be set to Exact. If you wish to evaluate these columns as Fuzzy, a workaround is to convert non-string columns into string columns. I typically do this in my Fuzzy Lookups.
6. Make your comparison set as wide as possible
The goal with Fuzzy logic is to find accurate matches based on similarity of entire records, not just exact matches based on a narrow set of fields. The more fields you add to your comparison, the better chance you will find an accurate match if only 1 or 2 fields don’t match. It also reduces the chances of finding a false positive match because incorrectly inputted data matched to something it should not match to.
7. Check your results with your eyes
Regardless of how you handle your matching process, it is tedious. It would be nice to have an automated process, but the truth of the matter is that someone needs to review the matches and see how good they are. The benefit I’ve found of using Fuzzy Lookups is that you spend less time altering your process, and more time reviewing results. I usually go through multiple rounds of adjusting the similarity threshold before I decide on any setting. I’ve seen some writers recommend setting two thresholds: a higher threshold that is trusted as accurate, and then a lower threshold of ‘potential’ matches that can then be reviewed and decided on. I personally have not used this approach yet, but given my experiences thus far, I intend to incorporate this approach in my future projects. I hope these tips help you out on your implementations of Fuzzy tools. Stay Fuzzy!