Matchcode Optimization:Fuzzy Algorithms
Fuzzy Algorithms
MatchUp has an extensive list of fuzzy algorithm choices. Depending on the nature of the data being processed, selecting a specific algorithm may result in more flagged duplicates, but possibly with the tradeoff of a slower throughput. This is called balancing performance vs accuracy. The fuzzy algorithms, with a general performance rank from fastest (5) to slowest (1):
ALGORITHM RANK LATE or EARLY EXACT 5 Early VOWELS 5 Early NUMERICS 5 Early CONSONENTS 5 Early ALPHAS 5 Early SOUNDEX 4 Early PHONETEX 4 Early FREQUENCY 4 Late FASTNEAR 3 Late FREQNEAR 3 Late CONTAINMENT 3 Late NGRAM 2 Late ACCUNEAR 2 Late LCS 2 Late OVERLAP COEFFICIENT 1 Late JACCARD 1 Late SMITH WATERMAN 1 Late MDKEYBOARD 1 Late UTF8NEAR 1 Late JARO 1 Late JAROWINK 1 Late DICES COEFFICIENT 1 Late DOUBLE 1 Late NEEDLEMAN WUNSCH 1 Late
These algorithms fall into two categories: early matching and late matching.
Early Matching
- Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation, which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.
Late Matching
- Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.
Matching Speed
- Therefore, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed, and often will give nearly the same results. Test thoroughly before processing live data.
Accuracy
Using an Exact fuzzy setting will return a logical Boolean answer based on the matchkey – the two keys are either ‘Exactly’ the same and therefore match, or are not exactly the same, and therefore do not match. Fuzzy algorithms make allowances for un-exact data.
Since each algorithm calculates the variation allowance differently, some algorithms perform more accurately over others for differently constructed data
In choosing an algorithm with respect to accuracy, consider the following types of data:
- Value Type cases for Fuzzy Algorithm usage
- String : % similarity between two strings.
- Knowledgebase: The presence of keys words (HS v High School) must be evaluated.
- Dictionary (or decode) Arrays: "01 03 46 82" vs "06 46 03 01".
- Value types where fuzzy algorithms are not recommended
- Quantifiable: numbers, dates, phone values, account numbers, etc.
- Use cases where fuzzy algorithms are not recommended
- Record consolidation: Gather, Survivorship, record roll-up.
Pro and con recommendations are made in each algorithms page.
In many cases the algorithm output has been normalized so the return value can be compared against the user configured distance threshold percentage.