Matchcode Optimization:Jaro
Jaro
Specifics
- Winkler Distance
Summary
- Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Returns
- Percentage of similarity
- 1/3 * (common/len1 + common/len2 + (common-transpositions)/common)
- Where common is defined as a character match if the distance within the 2 strings is within the algorithms defined range. Transpositions are defined as: a character match (but different sequence order) /2
Example Matchcode Component
Example Data
STRING1 STRING2 RESULT Johnson Jhnsn Match Found Maguire Mcguire Match Found Beaumarchais Bumarchay Unique Deanardo Dinardio Unique
Performance | |||||
---|---|---|---|---|---|
Slower | Faster | ||||
Matches | |||||
More Matches | Greater Accuracy |
Recommended Usage
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database.
- Databases created with abbreviations or similar word substitutions.
Not Recommended For
- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.
- Databases created via real-time data entry where audio likeness errors are introduced.
Do Not Use With
- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.