Matchcode Optimization:Jaro: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 5: | Line 5: | ||
==Jaro== | ==Jaro== | ||
===Specifics=== | ===Specifics=== | ||
Winkler Distance | :Winkler Distance | ||
*http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance | :*http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance | ||
===Summary=== | ===Summary=== | ||
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings. | :Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings. | ||
===Returns=== | ===Returns=== | ||
Percentage of similarity | :Percentage of similarity | ||
1/3 * (common/len1 + common/len2 + (common-transpositions)/common) | :1/3 * (common/len1 + common/len2 + (common-transpositions)/common) | ||
Where common is defined as a character match if the distance within the 2 strings is within the algorithms defined range. Transpositions are defined as: a character match (but different sequence order) /2 | :Where common is defined as a character match if the distance within the 2 strings is within the algorithms defined range. Transpositions are defined as: a character match (but different sequence order) /2 | ||
===Example Matchcode Component=== | ===Example Matchcode Component=== | ||
Line 24: | Line 24: | ||
{{ExampleDataTableV1|STRING1|STRING2|RESULT | {{ExampleDataTableV1|STRING1|STRING2|RESULT | ||
|AdditionalRows= | |AdditionalRows= | ||
{{EDTRow| | {{EDTRow|Green|Johnson|Jhnsn|Match Found}} | ||
{{EDTRow| | {{EDTRow|Green|Maguire|Mcguire|Match Found}} | ||
{{EDTRow| | {{EDTRow|White|Beaumarchais|Bumarchay|Unique}} | ||
{{EDTRow| | {{EDTRow|White|Deanardo|Dinardio|Unique}} | ||
}} | }} | ||
Line 39: | Line 39: | ||
===Recommended Usage=== | ===Recommended Usage=== | ||
Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. | :Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. | ||
Databases created with abbreviations or similar word substitutions. | :Databases created with abbreviations or similar word substitutions. | ||
===Not Recommended For=== | ===Not Recommended For=== | ||
Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow. | :Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow. | ||
Databases created via real-time data entry where audio likeness errors are introduced. | :Databases created via real-time data entry where audio likeness errors are introduced. | ||
===Do Not Use With=== | ===Do Not Use With=== | ||
UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters. | :UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters. | ||
[[Category:MatchUp Hub]] | [[Category:MatchUp Hub]] | ||
[[Category:Matchcode Optimization]] | [[Category:Matchcode Optimization]] |
Latest revision as of 14:23, 27 September 2018
Jaro
Specifics
- Winkler Distance
Summary
- Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Returns
- Percentage of similarity
- 1/3 * (common/len1 + common/len2 + (common-transpositions)/common)
- Where common is defined as a character match if the distance within the 2 strings is within the algorithms defined range. Transpositions are defined as: a character match (but different sequence order) /2
Example Matchcode Component
Example Data
STRING1 STRING2 RESULT Johnson Jhnsn Match Found Maguire Mcguire Match Found Beaumarchais Bumarchay Unique Deanardo Dinardio Unique
Performance | |||||
---|---|---|---|---|---|
Slower | Faster | ||||
Matches | |||||
More Matches | Greater Accuracy |
Recommended Usage
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database.
- Databases created with abbreviations or similar word substitutions.
Not Recommended For
- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.
- Databases created via real-time data entry where audio likeness errors are introduced.
Do Not Use With
- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.