http://wiki.melissadata.com/index.php?title=Pentaho/Contact_Zone:MatchUp:Algorithms&feed=atom&action=historyPentaho/Contact Zone:MatchUp:Algorithms - Revision history2024-03-28T12:36:16ZRevision history for this page on the wikiMediaWiki 1.37.2http://wiki.melissadata.com/index.php?title=Pentaho/Contact_Zone:MatchUp:Algorithms&diff=12381&oldid=prevAdmin: Created page with "{{PentahoCZMatchUpNav |MatchcodeEditorCollapse= }} {{CustomTOC}} The MatchUp Editor can use the following matching algorithms: ===Exact Matching=== Determines whether two s..."2017-01-11T20:58:11Z<p>Created page with "{{PentahoCZMatchUpNav |MatchcodeEditorCollapse= }} {{CustomTOC}} The MatchUp Editor can use the following matching algorithms: ===Exact Matching=== Determines whether two s..."</p>
<p><b>New page</b></p><div>{{PentahoCZMatchUpNav<br />
|MatchcodeEditorCollapse=<br />
}}<br />
<br />
{{CustomTOC}}<br />
<br />
The MatchUp Editor can use the following matching algorithms:<br />
<br />
===Exact Matching===<br />
Determines whether two strings are identical.<br />
<br />
===Soundex===<br />
SoundEx is a string transformation and comparison-based algorithm. For example, JOHNSON would be transformed to "J525" and JHNSN would also be transformed to "J525" which would then be considered a SoundExing match after evaluation.<br />
If the original strings are identical, SoundEx will return 100%. If the SoundEx'd strings are equal, the algorithm returns 99%. Otherwise, SoundEx will return 0%.<br />
<br />
===Phonetex===<br />
A variation of the SoundEx Algorithm. PhonetEx takes into account letter combinations that sound alike, particularly at the start of the word (such as 'PN' = 'N', 'PH' = 'F').<br />
<br />
===Containment===<br />
Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”<br />
<br />
===Frequency===<br />
Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”<br />
<br />
===Fast Near===<br />
A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.<br />
<br />
===Accurate Near===<br />
An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.<br />
<br />
===Frequency Near===<br />
The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example "abcdef" would be considered a 100% match to "badcfe."<br />
<br />
===Vowels===<br />
Only vowels will be compared. Consonants will be removed.<br />
<br />
===Consonants===<br />
Only consonants will be compared. Vowels will be removed.<br />
<br />
===Alphabetic===<br />
Only alphabetic characters will be compared.<br />
<br />
===Numeric===<br />
Only numeric characters will be compared. Decimals and signs are considered numeric.<br />
<br />
===N-Gram===<br />
Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.<br />
<br />
===Jaro===<br />
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.<br />
<br />
===Jaro-Winkler===<br />
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).<br />
<br />
===Longest Common Substring (LCS)===<br />
Finds the longest common substring between the two strings.<br />
<br />
===Needleman-Wunch===<br />
A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0.<br />
<br />
===MD Keyboard===<br />
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.<br />
This effectively adds the "understanding" that the keyboarder may have typed in one character before another.<br />
<br />
===Smith-Waterman-Gotoh===<br />
A variation of the Needleman-Wunsch algorithm. Needleman-Wunsch and Smith-Waterman-Gotoh are identical except that character deletions are given a different weight.<br />
This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words.<br />
<br />
===Dice’s Coefficient===<br />
A variation of the N-Gram algorithm. Dice's Coefficient counts matching n-Grams but does not count extra duplicate n-Grams.<br />
<br />
===Jaccard Similarity Coefficient===<br />
A variation of the N-Gram algorithm. The Jaccard Similarity is identical to the N-Gram algorithm but uses a different formula for similarity computation.<br />
<br />
===Overlap Coefficient===<br />
A variation of the N-Gram algorithm. The Overlap Coefficient is identical to the N-Gram algorithm but uses a different formula for similarity computation.<br />
<br />
===Double MetaPhone===<br />
A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings.<br />
The logic used for Double Metaphone Similarity works as follows:<br />
*If primary1 = primary2 and alternate1 = alternate 2, then we have a very good match (99%).<br />
*If either primary1 = alternate2 or alternate1 = primary2, and alternate1=alternate2, then we have a good match (85%).<br />
*If alternate 1 = alternate2, we have an acceptable match (75%).<br />
<br />
<br />
[[Category:Pentaho]]<br />
[[Category:Contact Zone]]<br />
[[Category:MatchUp Component]]</div>Admin