Matchcode Optimization:Fuzzy Algorithms

From Melissa Data Wiki
Revision as of 22:03, 21 September 2018 by Admin (talk | contribs)
Jump to navigation Jump to search

← MatchUp Hub

Matchcode Optimization Navigation
Matchcode Optimization
First Component
Fuzzy Algorithms
Swap Matching
Blank Matching
Advanced Component Types
Algorithms
Accunear
Alphas
Consonants
Containment
Dice's Coefficient
Double Metaphone
Exact
Fast Near
Frequency
Frequency Near
Jaccard Similarity Coefficient
Jaro
Jaro-Winkler
Longest Common Substring (LCS)
MD Keyboard
Needleman-Wunsch
N-Gram
Numeric
Overlap Coefficient
Phonetex
Smith-Waterman-Gotoh
Soundex
UTF8 Near
Vowels


Fuzzy Algorithms

MatchUp has an extensive list of fuzzy algorithm choices. Depending on the nature of the data being processed, selecting a specific algorithm may result in more flagged duplicates, but possibly with the tradeoff of a slower throughput. This is called balancing performance vs accuracy. The fuzzy algorithms, with a general performance rank from fastest (5) to slowest (1):


ALGORITHM RANK LATE or EARLY
EXACT 5 Early
VOWELS 5 Early
NUMERICS 5 Early
CONSONENTS 5 Early
ALPHAS 5 Early
SOUNDEX 4 Early
PHONETEX 4 Early
FREQUENCY 4 Late
FASTNEAR 3 Late
FREQNEAR 3 Late
CONTAINMENT 3 Late
NGRAM 2 Late
ACCUNEAR 2 Late
LCS 2 Late
OVERLAP COEFFICIENT 1 Late
JACCARD 1 Late
SMITH WATERMAN 1 Late
MDKEYBOARD 1 Late
UTF8NEAR 1 Late
JARO 1 Late
JAROWINK 1 Late
DICES COEFFICIENT 1 Late
DOUBLE 1 Late
NEEDLEMAN WUNSCH 1 Late


These algorithms fall into two categories: early matching and late matching.

Early Matching

Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation, which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.

Late Matching

Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.

Matching Speed

Therefore, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed, and often will give nearly the same results. Test thoroughly before processing live data.

Accuracy

Using an Exact fuzzy setting will return a logical Boolean answer based on the matchkey – the two keys are either ‘Exactly’ the same and therefore match, or are not exactly the same, and therefore do not match. Fuzzy algorithms make allowances for un-exact data.

Since each algorithm calculates the variation allowance differently, some algorithms perform more accurately over others for differently constructed data

In choosing an algorithm with respect to accuracy, consider the following types of data:

Value Type cases for Fuzzy Algorithm usage
  • String :  % similarity between two strings.
  • Knowledgebase: The presence of keys words (HS v High School) must be evaluated.
  • Dictionary (or decode) Arrays: "01 03 46 82" vs "06 46 03 01".
Value types where fuzzy algorithms are not recommended
  • Quantifiable: numbers, dates, phone values, account numbers, etc.
Use cases where fuzzy algorithms are not recommended
  • Record consolidation: Gather, Survivorship, record roll-up.


Pro and con recommendations are made in each algorithms page.

In many cases the algorithm output has been normalized so the return value can be compared against the user configured distance threshold percentage.