Matchcode Optimization:Overlap Coefficient
Jump to navigation Jump to search
- Like Jaro or Dice, counts matching n-Grams (discarding duplicate NGRAMs), but uses a slightly different calculation weighted towards the smaller of the two strings being compared.
- Percentage of similarity
- Where union is defined as the number of matching NGAMS found
- Where minNumNGrams is defined as the smallest number of possible NGRAMS of the two strings
- NGRAM is defined as the size of the substring to search for within a string (default is 2).
Example Matchcode Component
STRING1 STRING2 RESULT Johnson Jhnsn Unique Neumon Pneumon Match Found Maytown Hs Maytown Public Schools Match Found Rober Roberts Match Found
|More Matches||Greater Accuracy|
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database.
- Databases created with abbreviations or similar word substitutions.
Not Recommended For
- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.
- Databases created via real-time data entry where audio likeness errors are introduced.
Do Not Use With
- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.