SSIS:MatchUp:Matchcodes:Optimization
← SSIS:Data Quality Components
Some matchcodes process much faster than others in spite of the fact that they detect the same matches. This section will assist in creating the most efficient matchcodes. Ninety-nine percent of the time, clicking the Optimize button in the Matchcode Editor will sufficiently optimize a matchcode. This discussion is included so developers can better understand why certain things are done while optimizing, as well as what can be done to make the optimizer work more effectively.
Optimizing can make a significant difference in processing speed. 58-hour runs have been reduced to four hours simply by optimizing the matchcode.
It is important, however, that the developer verifies that a matchcode works in the intended way before attempting any optimizations. If a matchcode is not functioning properly, these optimizations will not help, and could quite possibly make the situation worse.
Component Sequence
As discussed in the previous section, the first component of a matchcode has certain restrictions:
- It must be used in every combination.
- It cannot use certain types of Fuzzy Matching: Containment; Frequency; Fast Near; Frequency Near or Accurate Near (other types are okay, though).
- It cannot use Initial Only matching.
- It cannot use One Blank Field matching.
- It cannot use Swap matching.
If the matchcode's second component also follows these conditions, MatchUP Object will incorporate it into its clustering scheme (see page 3 for more information on clustering). Additional components, if they follow in sequence (third, fourth, and so on), will be used if they too satisfy these conditions. Incorporating a component into a cluster greatly reduces the number of comparisons MatchUP Object has to perform which, in turn, speeds up your processing.
This is a simple example of optimization:
Component | Size | Fuzzy | Blank | 1 | 2 |
---|---|---|---|---|---|
ZIP/PC | 5 | No | Yes | X | X |
Street # | 5 | No | Yes | X | |
Street Name | 5 | No | No | X | |
PO Box | 10 | No | No | X | |
Last Name | 5 | No | Yes | X | X |
As shown here, MatchUP Object will only cluster by ZIP/PC. But note that the last component satisfies all the conditions listed earlier.
Component | Size | Fuzzy | Blank | 1 | 2 |
---|---|---|---|---|---|
ZIP/PC | 5 | No | Yes | X | X |
Last Name | 5 | No | Yes | X | X |
Street # | 5 | No | Yes | X | |
Street Name | 5 | No | No | X | |
PO Box | 10 | No | No | X |
This simple optimization will produce significant improvements in speed. The Matchcode Optimizer will always perform this optimization.
Fuzzy Algorithms
Fuzzy algorithms fall into two categories: early matching and late matching.
Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUP, these transformations are performed during key generation (the BuildKey function in each interface), which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.
Late matching algorithms are actual comparison algorithms. Usually, one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.
Obviously, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed. Often, a faster algorithm will give nearly the same results, but it is a good idea to test any such change before processing live data.
The fuzzy algorithms, ranked from slowest to fastest:
Algorithm | Late or Early | Speed (10=fastest) |
---|---|---|
Jaro | Late | 1 |
Jaro-Winkler | Late | 1 |
n-Gram | Late | 1 |
Needleman-Wunch | Late | 1 |
Smith-Waterman-Gotoh | Late | 1 |
Dice’s Coefficient | Late | 1 |
Jaccard Similarity Coefficient | Late | 1 |
Overlap Coefficient | Late | 1 |
Longest Common Substring | Late | 1 |
Double Metaphone | Late | 1 |
Accurate Near | Late | 1 |
Fast Near | Late | 3 |
Containment | Late | 4 |
Frequency Near | Late | 4 |
Frequency | Late | 6 |
Phonetex | Early | 7 |
Soundex | Early | 8 |
Vowels Only | Early | 9 |
Numerics Only | Early | 9 |
Consonants Only | Early | 9 |
Alphas Only | Early | 9 |
Exact | N/A | 10 |
The speed values are only rough estimates.
Another benefit of using a faster fuzzy algorithm is that an application may be able to exploit the component sequence optimization shown earlier. All of the early matching algorithms satisfy the restrictions for first components.
The Matchcode Optimizer will not perform this optimization as it can have a significant impact on matching results.
Unnecessary Components
Components that are not used in any combinations (in other words, they have no X's in columns 1 through 16) are a sign of poor matchcode design.
Take the following matchcode:
Component | Size | Fuzzy | Blank | 1 | 2 |
---|---|---|---|---|---|
ZIP/PC | 5 | No | Yes | X | X |
Last Name | 5 | No | Yes | X | X |
First Name | 5 | No | Yes | ||
Street # | 5 | No | Yes | X | |
Street Name | 5 | No | No | X | |
PO Box | 10 | No | No | X |
First name is not being used in any combination. Perhaps it was used in a combination that has since been removed from this matchcode, but is no longer necessary.
Unnecessary Combinations
Take the following matchcode:
Component | Size | Fuzzy | Blank | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|---|
ZIP5 | 5 | No | Yes | X | X | X | X |
Last Name | 5 | No | Yes | X | X | X | X |
First Name | 5 | No | Yes | ||||
Street # | 5 | No | Yes | X | X | ||
Street Name | 5 | No | No | X | X | ||
PO Box | 10 | No | No | X | X |
Here are the four conditions for matching:
Condition #1: | ZIP/PC | Last Name | First Name | Street # | Street Name | |
---|---|---|---|---|---|---|
Condition #2: | ZIP/PC | Last Name | First Name | PO Box | ||
Condition #3: | ZIP/PC | Last Name | Street # | Street Name | ||
Condition #4: | ZIP/PC | Last Name | PO Box |
There is no match that will be detected by condition #1 that would not be detected by condition #3. Similarly, matches found by condition #2 will always be found by condition #4. In other words, condition 3 is a subset of condition 1, and condition 2 is a subset of condition 4. Subsets are rarely desirable.
So either conditions 1 and 2 aren’t needed or conditions 3 and 4 were a mistake. If conditions 1 and 2 are eliminated, the First Name component should also be removed, as it will not be needed.