SSIS:MatchUp:Algorithms: Difference between revisions
Created page with "← SSIS Reference {| class="mw-collapsible" cellspacing="2" style="background-color:#f9f9f9; border:1px solid #aaaaaa; font-size:9pt; color:#0645ad; pa..." |
No edit summary |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{SSISMatchUpNav | |||
|MatchcodeEditorCollapse= | |||
{ | }} | ||
{ | |||
| | |||
Line 85: | Line 11: | ||
Determines whether two strings are identical. | Determines whether two strings are identical. | ||
=== | ===Soundex=== | ||
SoundEx is a string transformation and comparison-based algorithm. For example, JOHNSON would be transformed to "J525" and JHNSN would also be transformed to "J525" which would then be considered a SoundExing match after evaluation. | |||
If the original strings are identical, SoundEx will return 100%. If the SoundEx'd strings are equal, the algorithm returns 99%. Otherwise, SoundEx will return 0%. | |||
===Phonetex=== | |||
A variation of the SoundEx Algorithm. PhonetEx takes into account letter combinations that sound alike, particularly at the start of the word (such as 'PN' = 'N', 'PH' = 'F'). | |||
===Containment=== | |||
Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.” | |||
===Frequency=== | |||
Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.” | |||
===Fast Near=== | |||
A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches. | |||
===Accurate Near=== | |||
An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower. | |||
===Frequency Near=== | |||
The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example "abcdef" would be considered a 100% match to "badcfe." | |||
===Vowels=== | |||
Only vowels will be compared. Consonants will be removed. | |||
===Consonants=== | |||
Only consonants will be compared. Vowels will be removed. | |||
===Alphabetic=== | |||
Only alphabetic characters will be compared. | |||
=== | ===Numeric=== | ||
Only numeric characters will be compared. Decimals and signs are considered numeric. | |||
===N-Gram=== | ===N-Gram=== | ||
Counts the number of common sub-strings (grams) | Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp. | ||
=== | ===Jaro=== | ||
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings. | |||
=== | ===Jaro-Winkler=== | ||
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters). | |||
=== | ===Longest Common Substring (LCS)=== | ||
Finds the longest common substring between the two strings. | |||
=== | ===Needleman-Wunch=== | ||
A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0. | |||
=== | ===MD Keyboard=== | ||
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings. | |||
This effectively adds the "understanding" that the keyboarder may have typed in one character before another. | |||
===Smith-Waterman-Gotoh=== | ===Smith-Waterman-Gotoh=== | ||
Line 113: | Line 68: | ||
This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words. | This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words. | ||
=== | ===Dice’s Coefficient=== | ||
A variation of the | A variation of the N-Gram algorithm. Dice's Coefficient counts matching n-Grams but does not count extra duplicate n-Grams. | ||
=== | ===Jaccard Similarity Coefficient=== | ||
The | A variation of the N-Gram algorithm. The Jaccard Similarity is identical to the N-Gram algorithm but uses a different formula for similarity computation. | ||
=== | ===Overlap Coefficient=== | ||
A variation of the N-Gram algorithm. The Overlap Coefficient is identical to the N-Gram algorithm but uses a different formula for similarity computation. | |||
===Double | ===Double MetaPhone=== | ||
A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings. | A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings. | ||
The logic used for Double Metaphone Similarity works as follows: | The logic used for Double Metaphone Similarity works as follows: |
Latest revision as of 00:22, 14 November 2015
← SSIS:Data Quality Components
The MatchUp Editor can use the following matching algorithms:
Exact Matching
Determines whether two strings are identical.
Soundex
SoundEx is a string transformation and comparison-based algorithm. For example, JOHNSON would be transformed to "J525" and JHNSN would also be transformed to "J525" which would then be considered a SoundExing match after evaluation. If the original strings are identical, SoundEx will return 100%. If the SoundEx'd strings are equal, the algorithm returns 99%. Otherwise, SoundEx will return 0%.
Phonetex
A variation of the SoundEx Algorithm. PhonetEx takes into account letter combinations that sound alike, particularly at the start of the word (such as 'PN' = 'N', 'PH' = 'F').
Containment
Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”
Frequency
Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”
Fast Near
A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.
Accurate Near
An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.
Frequency Near
The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example "abcdef" would be considered a 100% match to "badcfe."
Vowels
Only vowels will be compared. Consonants will be removed.
Consonants
Only consonants will be compared. Vowels will be removed.
Alphabetic
Only alphabetic characters will be compared.
Numeric
Only numeric characters will be compared. Decimals and signs are considered numeric.
N-Gram
Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.
Jaro
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Jaro-Winkler
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).
Longest Common Substring (LCS)
Finds the longest common substring between the two strings.
Needleman-Wunch
A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0.
MD Keyboard
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings. This effectively adds the "understanding" that the keyboarder may have typed in one character before another.
Smith-Waterman-Gotoh
A variation of the Needleman-Wunsch algorithm. Needleman-Wunsch and Smith-Waterman-Gotoh are identical except that character deletions are given a different weight. This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words.
Dice’s Coefficient
A variation of the N-Gram algorithm. Dice's Coefficient counts matching n-Grams but does not count extra duplicate n-Grams.
Jaccard Similarity Coefficient
A variation of the N-Gram algorithm. The Jaccard Similarity is identical to the N-Gram algorithm but uses a different formula for similarity computation.
Overlap Coefficient
A variation of the N-Gram algorithm. The Overlap Coefficient is identical to the N-Gram algorithm but uses a different formula for similarity computation.
Double MetaPhone
A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings. The logic used for Double Metaphone Similarity works as follows:
- If primary1 = primary2 and alternate1 = alternate 2, then we have a very good match (99%).
- If either primary1 = alternate2 or alternate1 = primary2, and alternate1=alternate2, then we have a good match (85%).
- If alternate 1 = alternate2, we have an acceptable match (75%).