# Pentaho:MatchUp:Algorithms

← Data Quality Components for Pentaho

The MatchUp Editor can use the following matching algorithms:

### Exact Matching

Determines whether two strings are identical.

### Soundex

SoundEx is a string transformation and comparison-based algorithm. For example, JOHNSON would be transformed to "J525" and JHNSN would also be transformed to "J525" which would then be considered a SoundExing match after evaluation. If the original strings are identical, SoundEx will return 100%. If the SoundEx'd strings are equal, the algorithm returns 99%. Otherwise, SoundEx will return 0%.

### Phonetex

A variation of the SoundEx Algorithm. PhonetEx takes into account letter combinations that sound alike, particularly at the start of the word (such as 'PN' = 'N', 'PH' = 'F').

### Containment

Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”

### Frequency

Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”

### Fast Near

A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.

### Accurate Near

An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.

### Frequency Near

The Frequency algorithm will match the characters of one string to the characters of another without any regard to the sequence. For example "abcdef" would be considered a 100% match to "badcfe."

### Vowels

Only vowels will be compared. Consonants will be removed.

### Consonants

Only consonants will be compared. Vowels will be removed.

### Alphabetic

Only alphabetic characters will be compared.

### Numeric

Only numeric characters will be compared. Decimals and signs are considered numeric.

### N-Gram

Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.

### Jaro

Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.

### Jaro-Winkler

Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).

### Longest Common Substring (LCS)

Finds the longest common substring between the two strings.

### Needleman-Wunch

A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0.

### MD Keyboard

An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings. This effectively adds the "understanding" that the keyboarder may have typed in one character before another.

### Smith-Waterman-Gotoh

A variation of the Needleman-Wunsch algorithm. Needleman-Wunsch and Smith-Waterman-Gotoh are identical except that character deletions are given a different weight. This effectively adds the "understanding" that the keyboarder may have tried to abbreviate one of the words.

### Dice’s Coefficient

A variation of the N-Gram algorithm. Dice's Coefficient counts matching n-Grams but does not count extra duplicate n-Grams.

### Jaccard Similarity Coefficient

A variation of the N-Gram algorithm. The Jaccard Similarity is identical to the N-Gram algorithm but uses a different formula for similarity computation.

### Overlap Coefficient

A variation of the N-Gram algorithm. The Overlap Coefficient is identical to the N-Gram algorithm but uses a different formula for similarity computation.

### Double MetaPhone

A variation of the PhonetEx Algorithm. Double Metaphone performs 2 different PhonetEx-style transformations. It creates two PhonetEx-like strings (primary and alternate) for both strings. The logic used for Double Metaphone Similarity works as follows:

- If primary1 = primary2 and alternate1 = alternate 2, then we have a very good match (99%).
- If either primary1 = alternate2 or alternate1 = primary2, and alternate1=alternate2, then we have a good match (85%).
- If alternate 1 = alternate2, we have an acceptable match (75%).