Matchcode Optimization:UTF8 Near: Difference between revisions
Jump to navigation
Jump to search
Created page with "{{MatchcodeOptimizationNav |AlgorithmsCollapse= }} ==UTF-8 Near== ===Specifics=== An algorithm developed by Melissa Data to match multi-byte data. ===Summary=== UTF-8 Near i..." |
No edit summary |
||
Line 5: | Line 5: | ||
==UTF-8 Near== | ==UTF-8 Near== | ||
===Specifics=== | ===Specifics=== | ||
An algorithm developed by Melissa Data to match multi-byte data. | :An algorithm developed by Melissa Data to match multi-byte data. | ||
===Summary=== | ===Summary=== | ||
UTF-8 Near is meant to perform general distance (or string similarity) comparisons as an alternative to the other available algorithms which are designed to evaluate strings on a character for character basis. For many international extended character sets, a character cannot be represented by a single byte, and therefore makes results returned by those algorithms inaccurate. | :UTF-8 Near is meant to perform general distance (or string similarity) comparisons as an alternative to the other available algorithms which are designed to evaluate strings on a character for character basis. For many international extended character sets, a character cannot be represented by a single byte, and therefore makes results returned by those algorithms inaccurate. | ||
===Returns=== | ===Returns=== | ||
Percentage of similarity | :Percentage of similarity | ||
===Example Matchcode Component=== | ===Example Matchcode Component=== | ||
Line 20: | Line 20: | ||
|AdditionalRows= | |AdditionalRows= | ||
{{EDTRow|White|Johnson|Jhnsn|Unique}} | {{EDTRow|White|Johnson|Jhnsn|Unique}} | ||
{{EDTRow| | {{EDTRow|Green|Maguire|Mcguire|Match Found}} | ||
{{EDTRow| | {{EDTRow|White|Beaumarchais|Bumarchay|Unique}} | ||
{{EDTRow|Green|Asbjørn Aerocorp|Asbjorn Aerocorp|Match Found}} | {{EDTRow|Green|Asbjørn Aerocorp|Asbjorn Aerocorp|Match Found}} | ||
}} | }} | ||
Line 34: | Line 34: | ||
===Recommended Usage=== | ===Recommended Usage=== | ||
UTF-8 data. This algorithm was added to MatchUp with the assumption that international data contains multi-byte characters, making other algorithms inconsistent in accuracy for usage. | :UTF-8 data. This algorithm was added to MatchUp with the assumption that international data contains multi-byte characters, making other algorithms inconsistent in accuracy for usage. | ||
Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Batch level runs where other matchcode components are set to exact or databases of a single country origin. | :Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Batch level runs where other matchcode components are set to exact or databases of a single country origin. | ||
===Not Recommended For=== | ===Not Recommended For=== | ||
Databases merged from different countries and intended to match on a single data type. | :Databases merged from different countries and intended to match on a single data type. | ||
[[Category:MatchUp Hub]] | [[Category:MatchUp Hub]] | ||
[[Category:Matchcode Optimization]] | [[Category:Matchcode Optimization]] |
Latest revision as of 14:32, 27 September 2018
UTF-8 Near
Specifics
- An algorithm developed by Melissa Data to match multi-byte data.
Summary
- UTF-8 Near is meant to perform general distance (or string similarity) comparisons as an alternative to the other available algorithms which are designed to evaluate strings on a character for character basis. For many international extended character sets, a character cannot be represented by a single byte, and therefore makes results returned by those algorithms inaccurate.
Returns
- Percentage of similarity
Example Matchcode Component
Example Data
STRING1 STRING2 RESULT Johnson Jhnsn Unique Maguire Mcguire Match Found Beaumarchais Bumarchay Unique Asbjørn Aerocorp Asbjorn Aerocorp Match Found
Performance | |||||
---|---|---|---|---|---|
Slower | Faster | ||||
Matches | |||||
More Matches | Greater Accuracy |
Recommended Usage
- UTF-8 data. This algorithm was added to MatchUp with the assumption that international data contains multi-byte characters, making other algorithms inconsistent in accuracy for usage.
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Batch level runs where other matchcode components are set to exact or databases of a single country origin.
Not Recommended For
- Databases merged from different countries and intended to match on a single data type.