Admin: Created page with "{{PentahoMatchUpNav |MatchcodeEditorCollapse= }} {{CustomTOC}} The MatchUp Editor can use the following matching algorithms: ===Exact Matching=== Determines whether two str..."

2015-06-17T17:42:44Z

Created page with "{{PentahoMatchUpNav |MatchcodeEditorCollapse= }} {{CustomTOC}} The MatchUp Editor can use the following matching algorithms: ===Exact Matching=== Determines whether two str..."

New page

{{PentahoMatchUpNav
|MatchcodeEditorCollapse=
}}

{{CustomTOC}}

The MatchUp Editor can use the following matching algorithms:

===Exact Matching===
Determines whether two strings are identical.

===Soundex===
SoundEx is a string transformation and comparison-based algorithm. For example, JOHNSON would be transformed to "J525" and JHNSN would also be transformed to "J525" which would then be considered a SoundExing match after evaluation.
If the original strings are identical, SoundEx will return 100%. If the SoundEx'd strings are equal, the algorithm returns 99%. Otherwise, SoundEx will return 0%.

===Phonetex===
A variation of the SoundEx Algorithm. PhonetEx takes into account letter combinations that sound alike, particularly at the start of the word (such as 'PN' = 'N', 'PH' = 'F').

===Containment===
Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”

===Frequency===
Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”

===Fast Near===
A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.

===Accurate Near===
An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.

===Frequency Near===
The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example "abcdef" would be considered a 100% match to "badcfe."

===Vowels===
Only vowels will be compared. Consonants will be removed.

===Consonants===
Only consonants will be compared. Vowels will be removed.

===Alphabetic===
Only alphabetic characters will be compared.

===Numeric===
Only numeric characters will be compared. Decimals and signs are considered numeric.

===N-Gram===
Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.

===Jaro===
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.

===Jaro-Winkler===
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).

===Longest Common Substring (LCS)===
Finds the longest common substring between the two strings.

===Needleman-Wunch===
A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0.

===MD Keyboard===
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.
This effectively adds the "understanding" that the keyboarder may have typed in one character before another.

===Smith-Waterman-Gotoh===
A variation of the Needleman-Wunsch algorithm. Needleman-Wunsch and Smith-Waterman-Gotoh are identical except that character deletions are given a different weight.
This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words.

===Dice’s Coefficient===
A variation of the N-Gram algorithm. Dice's Coefficient counts matching n-Grams but does not count extra duplicate n-Grams.

===Jaccard Similarity Coefficient===
A variation of the N-Gram algorithm. The Jaccard Similarity is identical to the N-Gram algorithm but uses a different formula for similarity computation.

===Overlap Coefficient===
A variation of the N-Gram algorithm. The Overlap Coefficient is identical to the N-Gram algorithm but uses a different formula for similarity computation.

===Double MetaPhone===
A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings.
The logic used for Double Metaphone Similarity works as follows:
*If primary1 = primary2 and alternate1 = alternate 2, then we have a very good match (99%).
*If either primary1 = alternate2 or alternate1 = primary2, and alternate1=alternate2, then we have a good match (85%).
*If alternate 1 = alternate2, we have an acceptable match (75%).

[[Category:Pentaho]]

Pentaho:MatchUp:Algorithms - Revision history

Admin: Created page with "{{PentahoMatchUpNav |MatchcodeEditorCollapse= }} {{CustomTOC}} The MatchUp Editor can use the following matching algorithms: ===Exact Matching=== Determines whether two str..."