SSIS:Fuzzy Match:Matching Algorithms
Jump to navigation
Jump to search
← SSIS:Data Quality Components
Fuzzy Match Navigation | |||||
---|---|---|---|---|---|
Overview | |||||
Tutorial | |||||
| |||||
| |||||
Matching Algorithms | |||||
|
Matching Algorithms
The Fuzzy Match Component can use any of the following matching algorithms on any column in your database:
- Exact Matching
- Determines whether two strings are identical.
- Jaro
- Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
- Jaro-Winkler
- A variation to the Jaro algorithm. Strings that have matching characters at the beginning will be accounted for and are given additional weight to similarity.
- N-Gram
- Counts the number of common sub-strings (grams) of a specified length between the two strings.
- Dice's Coefficient
- A variation of the N-Gram algorithm. Dice's Coefficient counts matching n-Grams but does not count extra duplicate n-Grams.
- Jaccard Similarity
- A variation of the N-Gram algorithm. The Jaccard Similarity is identical to the N-Gram algorithm but uses a different formula for similarity computation.
- Overlap Coefficient
- A variation of the N-Gram algorithm. The Overlap Coefficient is identical to the N-Gram algorithm but uses a different formula for similarity computation.
- Levenshtein
- The Levenshtein algorithm computes for the similarity of two strings by taking into account the amount of character mistakes. Mistakes are based off the number of incorrect characters, inserted characters, and deleted characters.
- Needleman-Wunsch
- A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0.
- Smith-Waterman-Gotoh
- A variation of the Needleman-Wunsch algorithm. Needleman-Wunsch and Smith-Waterman-Gotoh are identical except that character deletions are given a different weight.
- This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words.
- MDKeyboard
- A variation of the Smith-Waterman-Gotoh algorithm. Smith-Waterman-Gotoh and MDKeyboard are identical except that character transpositions are given a different weight.
- This effectively adds the "understanding" that the keyboarder may have typed in one character before another.
- Longest Common Substring (LCS)
- The LCS algorithm counts for the longest common set of adjacent characters between 2 strings.
- Containment
- The Containment algorithm will return 100% if one string is a subset of another. A 0% is returned otherwise.
- Frequency
- The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example "abcdef" would be considered a 100% match to "badcfe."
- SoundEx
- SoundEx is a string transformation and comparison-based algorithm. For example, JOHNSON would be transformed to "J525" and JHNSN would also be transformed to "J525" which would then be considered a SoundExing match after evaluation.
- If the original strings are identical, SoundEx will return 100%. If the SoundEx'd strings are equal, the algorithm returns 99%. Otherwise, SoundEx will return 0%.
- PhonetEx
- A variation of the SoundEx Algorithm. PhonetEx takes into account letter combinations that sound alike, particularly at the start of the word (such as 'PN' = 'N', 'PH' = 'F').
- Double Metaphone
- A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings.
- The logic used for Double Metaphone Similarity works as follows:
- If primary1 = primary2 and alternate1 = alternate 2, then we have a very good match (99%).
- If either primary1 = alternate2 or alternate1 = primary2, and alternate1=alternate2, then we have a good match (85%).
- If alternate 1 = alternate2, we have an acceptable match (75%).