Matchcode Optimization:Overlap Coefficient
Overlap Coefficient
Specifics
Summary
- Like Jaro or Dice, counts matching n-Grams (discarding duplicate NGRAMs), but uses a slightly different calculation weighted towards the smaller of the two strings being compared.
Returns
- Percentage of similarity
- Union/MinNumNGrams
- Where union is defined as the number of matching NGAMS found
- Where minNumNGrams is defined as the smallest number of possible NGRAMS of the two strings
- NGRAM is defined as the size of the substring to search for within a string (default is 2).
Example Matchcode Component
Example Data
STRING1 STRING2 RESULT Johnson Jhnsn Unique Neumon Pneumon Match Found Maytown Hs Maytown Public Schools Match Found Rober Roberts Match Found
| Performance | |||||
|---|---|---|---|---|---|
| Slower | Faster | ||||
| Matches | |||||
| More Matches | Greater Accuracy | ||||
Recommended Usage
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database.
- Databases created with abbreviations or similar word substitutions.
Not Recommended For
- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.
- Databases created via real-time data entry where audio likeness errors are introduced.
Do Not Use With
- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.