Matchcode Optimization:UTF8 Near
Jump to navigation
Jump to search
UTF-8 Near
Specifics
- An algorithm developed by Melissa Data to match multi-byte data.
Summary
- UTF-8 Near is meant to perform general distance (or string similarity) comparisons as an alternative to the other available algorithms which are designed to evaluate strings on a character for character basis. For many international extended character sets, a character cannot be represented by a single byte, and therefore makes results returned by those algorithms inaccurate.
Returns
- Percentage of similarity
Example Matchcode Component
Example Data
STRING1 STRING2 RESULT Johnson Jhnsn Unique Maguire Mcguire Match Found Beaumarchais Bumarchay Unique Asbjørn Aerocorp Asbjorn Aerocorp Match Found
Performance | |||||
---|---|---|---|---|---|
Slower | Faster | ||||
Matches | |||||
More Matches | Greater Accuracy |
Recommended Usage
- UTF-8 data. This algorithm was added to MatchUp with the assumption that international data contains multi-byte characters, making other algorithms inconsistent in accuracy for usage.
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Batch level runs where other matchcode components are set to exact or databases of a single country origin.
Not Recommended For
- Databases merged from different countries and intended to match on a single data type.