Matchcode Optimization:Fuzzy Algorithms: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
{{UnderConstruction}} | |||
{{MatchcodeOptimizationNav | |||
|MatchcodeOptimizationCollapse= | |||
}} | |||
* | ==Fuzzy Algorithms== | ||
* | MatchUp has an extensive list of fuzzy algorithm choices. Depending on the nature of the data being processed, selecting a specific algorithm may result in more flagged duplicates, but possibly with the tradeoff of a slower throughput. This is called balancing performance vs accuracy. The fuzzy algorithms, with a general performance rank from fastest (5) to slowest (1): | ||
* | |||
* | |||
* | :{| class="alternate01" | ||
!ALGORITHM!!RANK!!LATE or EARLY | |||
|- | |||
|style="background-color:#c6e0b4;"|EXACT||5||Early | |||
|- | |||
|style="background-color:#c6e0b4;"|VOWELS||5||Early | |||
|- | |||
|style="background-color:#c6e0b4;"|NUMERICS||5||Early | |||
|- | |||
|style="background-color:#c6e0b4;"|CONSONENTS||5||Early | |||
|- | |||
|style="background-color:#c6e0b4;"|ALPHAS||5||Early | |||
|- | |||
|style="background-color:#e2efda;"|SOUNDEX||4||Early | |||
|- | |||
|style="background-color:#e2efda;"|PHONETEX||4||Early | |||
|- | |||
|style="background-color:#e2efda;"|FREQUENCY||4||Late | |||
|- | |||
|style="background-color:#ffffcc;"|FASTNEAR||3||Late | |||
|- | |||
|style="background-color:#ffffcc;"|FREQNEAR||3||Late | |||
|- | |||
|style="background-color:#ffffcc;"|CONTAINMENT||3||Late | |||
|- | |||
|style="background-color:#ffcc99;"|NGRAM||2||Late | |||
|- | |||
|style="background-color:#ffcc99;"|ACCUNEAR||2||Late | |||
|- | |||
|style="background-color:#ffcc99;"|LCS||2||Late | |||
|- | |||
|style="background-color:#ff9999;"|OVERLAP COEFFICIENT||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|JACCARD||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|SMITH WATERMAN||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|MDKEYBOARD||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|UTF8NEAR||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|JARO||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|JAROWINK||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|DICES COEFFICIENT||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|DOUBLE||1||Late | |||
|- | |||
|style="background-color:#ff9999;"|NEEDLEMAN WUNSCH||1||Late | |||
|} | |||
These algorithms fall into two categories: '''early matching''' and '''late matching'''. | |||
===Early Matching=== | |||
:Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation, which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built. | |||
===Late Matching=== | |||
:Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record. | |||
===Matching Speed=== | |||
:Therefore, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed, and often will give nearly the same results. Test thoroughly before processing live data. | |||
==Accuracy== | |||
Using an Exact fuzzy setting will return a logical Boolean answer based on the matchkey – the two keys are either ‘Exactly’ the same and therefore match, or are not exactly the same, and therefore do not match. Fuzzy algorithms make allowances for un-exact data. | |||
Since each algorithm calculates the variation allowance differently, some algorithms perform more accurately over others for differently constructed data | |||
In choosing an algorithm with respect to accuracy, consider the following types of data: | |||
:;Value Type cases for Fuzzy Algorithm usage: | |||
:*String : % similarity between two strings. | |||
:*Knowledgebase: The presence of keys words (HS v High School) must be evaluated. | |||
:*Dictionary (or decode) Arrays: "01 03 46 82" vs "06 46 03 01". | |||
:;Value types where fuzzy algorithms are not recommended: | |||
:*Quantifiable: numbers, dates, phone values, account numbers, etc. | |||
:;Use cases where fuzzy algorithms are not recommended: | |||
:*Record consolidation: Gather, Survivorship, record roll-up. | |||
Pro and con recommendations are made in each algorithms page. | |||
In many cases the algorithm output has been normalized so the return value can be compared against the user configured distance threshold percentage. | |||
[[Category:MatchUp Hub]] | |||
[[Category:Matchcode Optimization]] |
Revision as of 17:28, 20 September 2018
This page is still under construction!
Melissa Data strives to give you the most complete and up-to-date information about our products as possible. To do this, we must maintain our documentation. This means the content may not be complete or correct. Use at your own risk!
|
Fuzzy Algorithms
MatchUp has an extensive list of fuzzy algorithm choices. Depending on the nature of the data being processed, selecting a specific algorithm may result in more flagged duplicates, but possibly with the tradeoff of a slower throughput. This is called balancing performance vs accuracy. The fuzzy algorithms, with a general performance rank from fastest (5) to slowest (1):
ALGORITHM RANK LATE or EARLY EXACT 5 Early VOWELS 5 Early NUMERICS 5 Early CONSONENTS 5 Early ALPHAS 5 Early SOUNDEX 4 Early PHONETEX 4 Early FREQUENCY 4 Late FASTNEAR 3 Late FREQNEAR 3 Late CONTAINMENT 3 Late NGRAM 2 Late ACCUNEAR 2 Late LCS 2 Late OVERLAP COEFFICIENT 1 Late JACCARD 1 Late SMITH WATERMAN 1 Late MDKEYBOARD 1 Late UTF8NEAR 1 Late JARO 1 Late JAROWINK 1 Late DICES COEFFICIENT 1 Late DOUBLE 1 Late NEEDLEMAN WUNSCH 1 Late
These algorithms fall into two categories: early matching and late matching.
Early Matching
- Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation, which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.
Late Matching
- Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.
Matching Speed
- Therefore, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed, and often will give nearly the same results. Test thoroughly before processing live data.
Accuracy
Using an Exact fuzzy setting will return a logical Boolean answer based on the matchkey – the two keys are either ‘Exactly’ the same and therefore match, or are not exactly the same, and therefore do not match. Fuzzy algorithms make allowances for un-exact data.
Since each algorithm calculates the variation allowance differently, some algorithms perform more accurately over others for differently constructed data
In choosing an algorithm with respect to accuracy, consider the following types of data:
- Value Type cases for Fuzzy Algorithm usage
- String : % similarity between two strings.
- Knowledgebase: The presence of keys words (HS v High School) must be evaluated.
- Dictionary (or decode) Arrays: "01 03 46 82" vs "06 46 03 01".
- Value types where fuzzy algorithms are not recommended
- Quantifiable: numbers, dates, phone values, account numbers, etc.
- Use cases where fuzzy algorithms are not recommended
- Record consolidation: Gather, Survivorship, record roll-up.
Pro and con recommendations are made in each algorithms page.
In many cases the algorithm output has been normalized so the return value can be compared against the user configured distance threshold percentage.