Have you ever had a data accuracy issue? A situation where you have to build complex logic to compare data between two systems because names are misspelled, or data is missing in your matching columns?
I ran into this recently with a project migrating data between two environments. We wanted to make sure that the new data would match well with the existing data. We defined multiple ‘rounds’ of matching, where each round would attempt to find matches based on different criteria. After working on these rounds for a while and trying to resolve our mismatches with rounds and cleaning up the data on our own, we still ended up with a sizeable set that still didn’t match.
Then I remembered the Fuzzy transformations available within SSIS, specifically the Fuzzy Lookup for this scenario. I had not used it before, but I had read that they could help in situations like this. We implemented it with our process, and it was able to identify pairings that our rounds could not find.
For those not familiar with fuzzy transformations, they allow you to find matches between two sets of data based on how similar they are (as opposed to typical matching in SQL, where your matching criteria must match exactly in order to be identified as a match). The general process starts with creating a reference index. This is your dataset that will be compared to the data in question in an attempt to find a match. The records in your index should be as distinct as possible, as it will make it easier for the transformation to find the correct match.
Once the index is created, you then define the columns you wish to attempt to match on. The fuzzy algorithms will use the columns you identify, compare them between your dataset and the reference index, and calculate a similarity percentage between the two records. It will return a similarity percentage for each column being compared, as well as an aggregate similarity percentage of all columns. You also need to define the Lookup Column(s), or what value you want to return when a match is found. This is typically a key defined in your reference index. When the fuzzy transformation is complete, you can then use this key to join between the two datasets.
The resulting dataset will include all of the records in your test dataset, each record including the Lookup column that you defined. The final step is to define what matches are included in your result. There are two main settings for this.
The first is a similarity threshold. In order for the fuzzy algorithms to return a match in the Lookup column, it needs to find an aggregate similarity percentage greater than the similarity threshold you defined. If a match is found, but its similarity is below the threshold, a NULL will be returned for the lookup column. If you set the threshold too high, you risk missing matches that were close, but not close enough. If you set the threshold too low, you risk detecting matches between records that are not actual matches.
The second setting for your result set is the number of matches you wish to return. Typically in an automated process, you would set the number of matches to 1. In this case, you only want to return the record that matches the most, and returning multiple matches will cause duplicates in your dataset. However, for debugging purposes, it can be beneficial to return multiple matches. This allows you to see if there are multiple similar records that match to your index, which will allow you to tune your fuzzy lookup settings to the appropriate columns and thresholds.
From my initial experience, there were a number of lessons learned in how to effectively implement Fuzzy transformations into a matching process. Here are some of the highlights:
- Your SSIS solution must be set to run in 64-bit runtime. The Fuzzy transformations cannot be run in 32-bit runtime. This may cause conflict with certain connection managers that are unable to run in a 64-bit package.
- When running a Fuzzy Lookup transformation, the SSIS package loads your entire reference index into memory before it performs its analysis. Be sure you have enough memory available to store the entire index. A lack of memory will result in either slow processing or failure of the process.
- When deciding which columns to match on, be sure to pick columns that have a number of distinct values. There is little benefit in adding a column to your transformation that will have the same value in 90% of your records.
Now that I have used a fuzzy transformation, I see quite a few use cases where it can improve data quality in a matching process. Should the situation arise again, I will add this to my list of available options.