Matchcode Optimization:Overlap Coefficient: Difference between revisions
Jump to navigation
Jump to search
Created page with "{{MatchcodeOptimizationNav |AlgorithmsCollapse= }} ==Overlap Coefficient== ===Specifics=== *https://en.wikipedia.org/wiki/Overlap_coefficient ===Summary=== Like Jaro or Dice..." |
No edit summary |
||
| Line 5: | Line 5: | ||
==Overlap Coefficient== | ==Overlap Coefficient== | ||
===Specifics=== | ===Specifics=== | ||
*https://en.wikipedia.org/wiki/Overlap_coefficient | :*https://en.wikipedia.org/wiki/Overlap_coefficient | ||
===Summary=== | ===Summary=== | ||
Like Jaro or Dice, counts matching n-Grams (discarding duplicate NGRAMs), but uses a slightly different calculation weighted towards the smaller of the two strings being compared. | :Like Jaro or Dice, counts matching n-Grams (discarding duplicate NGRAMs), but uses a slightly different calculation weighted towards the smaller of the two strings being compared. | ||
===Returns=== | ===Returns=== | ||
Percentage of similarity | :Percentage of similarity | ||
Union/MinNumNGrams | :Union/MinNumNGrams | ||
Where union is defined as the number of matching NGAMS found | :Where union is defined as the number of matching NGAMS found | ||
Where minNumNGrams is defined as the smallest number of possible NGRAMS of the two strings | :Where minNumNGrams is defined as the smallest number of possible NGRAMS of the two strings | ||
NGRAM is defined as the size of the substring to search for within a string (default is 2). | :NGRAM is defined as the size of the substring to search for within a string (default is 2). | ||
===Example Matchcode Component=== | ===Example Matchcode Component=== | ||
| Line 28: | Line 28: | ||
|AdditionalRows= | |AdditionalRows= | ||
{{EDTRow|White|Johnson|Jhnsn|Unique}} | {{EDTRow|White|Johnson|Jhnsn|Unique}} | ||
{{EDTRow| | {{EDTRow|Green|Neumon|Pneumon|Match Found}} | ||
{{EDTRow|Green|Maytown Hs|Maytown Public Schools|Match Found}} | {{EDTRow|Green|Maytown Hs|Maytown Public Schools|Match Found}} | ||
{{EDTRow|Green|Rober|Roberts|Match Found}} | {{EDTRow|Green|Rober|Roberts|Match Found}} | ||
| Line 42: | Line 42: | ||
===Recommended Usage=== | ===Recommended Usage=== | ||
Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. | :Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. | ||
Databases created with abbreviations or similar word substitutions. | :Databases created with abbreviations or similar word substitutions. | ||
===Not Recommended For=== | ===Not Recommended For=== | ||
Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow. | :Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow. | ||
Databases created via real-time data entry where audio likeness errors are introduced. | :Databases created via real-time data entry where audio likeness errors are introduced. | ||
===Do Not Use With=== | ===Do Not Use With=== | ||
UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters. | :UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters. | ||
[[Category:MatchUp Hub]] | [[Category:MatchUp Hub]] | ||
[[Category:Matchcode Optimization]] | [[Category:Matchcode Optimization]] | ||
Latest revision as of 14:29, 27 September 2018
Overlap Coefficient
Specifics
Summary
- Like Jaro or Dice, counts matching n-Grams (discarding duplicate NGRAMs), but uses a slightly different calculation weighted towards the smaller of the two strings being compared.
Returns
- Percentage of similarity
- Union/MinNumNGrams
- Where union is defined as the number of matching NGAMS found
- Where minNumNGrams is defined as the smallest number of possible NGRAMS of the two strings
- NGRAM is defined as the size of the substring to search for within a string (default is 2).
Example Matchcode Component
Example Data
STRING1 STRING2 RESULT Johnson Jhnsn Unique Neumon Pneumon Match Found Maytown Hs Maytown Public Schools Match Found Rober Roberts Match Found
| Performance | |||||
|---|---|---|---|---|---|
| Slower | Faster | ||||
| Matches | |||||
| More Matches | Greater Accuracy | ||||
Recommended Usage
- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database.
- Databases created with abbreviations or similar word substitutions.
Not Recommended For
- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.
- Databases created via real-time data entry where audio likeness errors are introduced.
Do Not Use With
- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.