# Difference between revisions of "Matchcode Optimization:Jaccard Similarity Coefficient"

Jump to navigation
Jump to search

(Created page with "{{MatchcodeOptimizationNav |AlgorithmsCollapse= }} ==Jaccard Similarity== ===Specifics=== Jaccard Index *http://en.wikipedia.org/wiki/Jaccard_index ===Summary=== Jaccard Sim...") |
|||

Line 5: | Line 5: | ||

==Jaccard Similarity== | ==Jaccard Similarity== | ||

===Specifics=== | ===Specifics=== | ||

− | Jaccard Index | + | :Jaccard Index |

− | *http://en.wikipedia.org/wiki/Jaccard_index | + | :*http://en.wikipedia.org/wiki/Jaccard_index |

===Summary=== | ===Summary=== | ||

− | Jaccard Similarity Index is defined as the size of the intersection divided by the size of the union of the sample sets. | + | :Jaccard Similarity Index is defined as the size of the intersection divided by the size of the union of the sample sets. |

===Returns=== | ===Returns=== | ||

− | Percentage of similarity | + | *Percentage of similarity |

− | Intersection/Union | + | *Intersection/Union |

− | NGRAM is defined as the length of common strings this algorithm looks for. Matchup default is NGRAM = 2. For “ABCD” vs “GABCE”, Matching NGRAMS would be “AB” and “BC”. | + | *NGRAM is defined as the length of common strings this algorithm looks for. Matchup default is NGRAM = 2. For “ABCD” vs “GABCE”, Matching NGRAMS would be “AB” and “BC”. |

− | Intersection is defined as the number of common NGRAMS and union is the total number of NGRAMS in the universe of the two strings. | + | *Intersection is defined as the number of common NGRAMS and union is the total number of NGRAMS in the universe of the two strings. |

===Example Matchcode Component=== | ===Example Matchcode Component=== | ||

Line 27: | Line 27: | ||

|AdditionalRows= | |AdditionalRows= | ||

{{EDTRow|White|Johnson|Jhnsn|Unique}} | {{EDTRow|White|Johnson|Jhnsn|Unique}} | ||

− | {{EDTRow| | + | {{EDTRow|Green|Mild Hatter|Mild Hatter Wks|Match Found}} |

− | {{EDTRow| | + | {{EDTRow|White|Beaumarchais|Bumarchay|Unique}} |

{{EDTRow|Green|Apco Oil Lube 170|Apco Oil Lube 342|Match Found}} | {{EDTRow|Green|Apco Oil Lube 170|Apco Oil Lube 342|Match Found}} | ||

}} | }} | ||

Line 41: | Line 41: | ||

===Recommended Usage=== | ===Recommended Usage=== | ||

− | Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Databases created with abbreviations or similar word substitutions. | + | :Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Databases created with abbreviations or similar word substitutions. |

===Not Recommended For=== | ===Not Recommended For=== | ||

− | Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow. | + | :Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow. |

− | Databases created via real-time data entry where audio likeness errors are introduced. | + | :Databases created via real-time data entry where audio likeness errors are introduced. |

===Do Not Use With=== | ===Do Not Use With=== | ||

− | UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters. | + | :UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters. |

[[Category:MatchUp Hub]] | [[Category:MatchUp Hub]] | ||

[[Category:Matchcode Optimization]] | [[Category:Matchcode Optimization]] |

## Revision as of 14:21, 27 September 2018

## Jaccard Similarity

### Specifics

- Jaccard Index

### Summary

- Jaccard Similarity Index is defined as the size of the intersection divided by the size of the union of the sample sets.

### Returns

- Percentage of similarity

- Intersection/Union

- NGRAM is defined as the length of common strings this algorithm looks for. Matchup default is NGRAM = 2. For “ABCD” vs “GABCE”, Matching NGRAMS would be “AB” and “BC”.

- Intersection is defined as the number of common NGRAMS and union is the total number of NGRAMS in the universe of the two strings.

### Example Matchcode Component

### Example Data

STRING1 STRING2 RESULT Johnson Jhnsn Unique Mild Hatter Mild Hatter Wks Match Found Beaumarchais Bumarchay Unique Apco Oil Lube 170 Apco Oil Lube 342 Match Found

Performance | |||||
---|---|---|---|---|---|

Slower | Faster | ||||

Matches | |||||

More Matches | Greater Accuracy |

### Recommended Usage

- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Databases created with abbreviations or similar word substitutions.

### Not Recommended For

- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.

- Databases created via real-time data entry where audio likeness errors are introduced.

### Do Not Use With

- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.