Matchcode Optimization:Fuzzy Algorithms: Difference between revisions

Latest revision as of 21:18, 26 September 2018

Matchcode Optimization Navigation

Matchcode Optimization
First Component
Fuzzy Algorithms
Swap Matching
Blank Matching
Advanced Component Types

Algorithms
Accunear
Alphas
Consonants
Containment
Dice's Coefficient
Double Metaphone
Exact
Fast Near
Frequency
Frequency Near
Jaccard Similarity Coefficient
Jaro
Jaro-Winkler
Longest Common Substring (LCS)
MD Keyboard
Needleman-Wunsch
N-Gram
Numeric
Overlap Coefficient
Phonetex
Smith-Waterman-Gotoh
Soundex
UTF8 Near
Vowels

Fuzzy Algorithms

MatchUp has an extensive list of fuzzy algorithm choices. Depending on the nature of the data being processed, selecting a specific algorithm may result in more flagged duplicates, but possibly with the tradeoff of a slower throughput. This is called balancing performance vs accuracy. The fuzzy algorithms, with a general performance rank from fastest (5) to slowest (1):

ALGORITHM	RANK	LATE or EARLY
EXACT	5	Early
VOWELS	5	Early
NUMERIC	5	Early
CONSONANTS	5	Early
ALPHAS	5	Early
SOUNDEX	4	Early
PHONETEX	4	Early
FREQUENCY	4	Late
FAST NEAR	3	Late
FREQNEAR	3	Late
CONTAINMENT	3	Late
NGRAM	2	Late
ACCUNEAR	2	Late
LCS	2	Late
OVERLAP COEFFICIENT	1	Late
JACCARD	1	Late
SMITH-WATERMAN-GOTOH	1	Late
MD KEYBOARD	1	Late
UTF8 NEAR	1	Late
JARO	1	Late
JARO-WINKLER	1	Late
DICES COEFFICIENT	1	Late
DOUBLE METAPHONE	1	Late
NEEDLEMAN-WUNSCH	1	Late

These algorithms fall into two categories: early matching and late matching.

Early Matching

Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation, which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.

Late Matching

Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.

Matching Speed

Therefore, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed, and often will give nearly the same results. Test thoroughly before processing live data.

Accuracy

Using an Exact fuzzy setting will return a logical Boolean answer based on the matchkey – the two keys are either ‘Exactly’ the same and therefore match, or are not exactly the same, and therefore do not match. Fuzzy algorithms make allowances for un-exact data.

Since each algorithm calculates the variation allowance differently, some algorithms perform more accurately over others for differently constructed data

In choosing an algorithm with respect to accuracy, consider the following types of data:

Value Type cases for Fuzzy Algorithm usage

String : % similarity between two strings.
Knowledgebase: The presence of keys words (HS v High School) must be evaluated.
Dictionary (or decode) Arrays: "01 03 46 82" vs "06 46 03 01".

Value types where fuzzy algorithms are not recommended

Quantifiable: numbers, dates, phone values, account numbers, etc.

Use cases where fuzzy algorithms are not recommended

Record consolidation: Gather, Survivorship, record roll-up.

Pro and con recommendations are made in each algorithms page.

In many cases the algorithm output has been normalized so the return value can be compared against the user configured distance threshold percentage.

@@ Line 1: / Line 1: @@
-Placeholder Page
+{{MatchcodeOptimizationNav
+|MatchcodeOptimizationCollapse=
+}}
-*[[Matchcode Hub]]
+==Fuzzy Algorithms==
-*[[Matchcode Optimization]]
+MatchUp has an extensive list of fuzzy algorithm choices. Depending on the nature of the data being processed, selecting a specific algorithm may result in more flagged duplicates, but possibly with the tradeoff of a slower throughput. This is called balancing performance vs accuracy. The fuzzy algorithms, with a general performance rank from fastest (5) to slowest (1):
-*[[Matchcode Optimization:First Component]]
-*[[Matchcode Optimization:Fuzzy Algorithms]]
-*[[Matchcode Optimization:Swap Matching]]
+:{| class="alternate01"
-*[[Matchcode Optimization:Blank Matching]]
+!ALGORITHM!!RANK!!LATE or EARLY
-*[[Matchcode Optimization:Advanced Component Types]]
+|-
+|style="background-color:#c6e0b4;"|[[Matchcode Optimization:Exact|EXACT]]||5||Early
+|-
+|style="background-color:#c6e0b4;"|[[Matchcode Optimization:Vowels|VOWELS]]||5||Early
+|-
+|style="background-color:#c6e0b4;"|[[Matchcode Optimization:Numeric|NUMERIC]]||5||Early
+|-
+|style="background-color:#c6e0b4;"|[[Matchcode Optimization:Consonants|CONSONANTS]]||5||Early
+|-
+|style="background-color:#c6e0b4;"|[[Matchcode Optimization:Alphas|ALPHAS]]||5||Early
+|-
+|style="background-color:#e2efda;"|[[Matchcode Optimization:Soundex|SOUNDEX]]||4||Early
+|-
+|style="background-color:#e2efda;"|[[Matchcode Optimization:Phonetex|PHONETEX]]||4||Early
+|-
+|style="background-color:#e2efda;"|[[Matchcode Optimization:Frequency|FREQUENCY]]||4||Late
+|-
+|style="background-color:#ffffcc;"|[[Matchcode Optimization:Fast Near|FAST NEAR]]||3||Late
+|-
+|style="background-color:#ffffcc;"|[[Matchcode Optimization:Frequency Near|FREQNEAR]]||3||Late
+|-
+|style="background-color:#ffffcc;"|[[Matchcode Optimization:Containment|CONTAINMENT]]||3||Late
+|-
+|style="background-color:#ffcc99;"|[[Matchcode Optimization:N-Gram|NGRAM]]||2||Late
+|-
+|style="background-color:#ffcc99;"|[[Matchcode Optimization:Accunear|ACCUNEAR]]||2||Late
+|-
+|style="background-color:#ffcc99;"|[[Matchcode Optimization:Longest Common Substring (LCS)|LCS]]||2||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Overlap Coefficient|OVERLAP COEFFICIENT]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Jaccard Similarity Coefficient|JACCARD]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Smith-Waterman-Gotoh|SMITH-WATERMAN-GOTOH]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:MD Keyboard|MD KEYBOARD]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:UTF8 Near|UTF8 NEAR]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Jaro|JARO]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Jaro-Winkler|JARO-WINKLER]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Dice's Coefficient|DICES COEFFICIENT]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Double Metaphone|DOUBLE METAPHONE]]||1||Late
+|-
+|style="background-color:#ff9999;"|[[Matchcode Optimization:Needleman-Wunsch|NEEDLEMAN-WUNSCH]]||1||Late
+|}
+These algorithms fall into two categories: '''early matching''' and '''late matching'''.
+===Early Matching===
+:Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation, which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.
+===Late Matching===
+:Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.
+===Matching Speed===
+:Therefore, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed, and often will give nearly the same results. Test thoroughly before processing live data.
+==Accuracy==
+Using an Exact fuzzy setting will return a logical Boolean answer based on the matchkey – the two keys are either ‘Exactly’ the same and therefore match, or are not exactly the same, and therefore do not match. Fuzzy algorithms make allowances for un-exact data.
+Since each algorithm calculates the variation allowance differently, some algorithms perform more accurately over others for differently constructed data
+In choosing an algorithm with respect to accuracy, consider the following types of data:
+:;Value Type cases for Fuzzy Algorithm usage:
+:*String :  % similarity between two strings.
+:*Knowledgebase: The presence of keys words (HS v High School) must be evaluated.
+:*Dictionary (or decode) Arrays: "01 03 46 82" vs "06 46 03 01".
+:;Value types where fuzzy algorithms are not recommended:
+:*Quantifiable: numbers, dates, phone values, account numbers, etc.
+:;Use cases where fuzzy algorithms are not recommended:
+:*Record consolidation: Gather, Survivorship, record roll-up.
+Pro and con recommendations are made in each algorithms page.
+In many cases the algorithm output has been normalized so the return value can be compared against the user configured distance threshold percentage.
+[[Category:MatchUp Hub]]
+[[Category:Matchcode Optimization]]

Matchcode Optimization:Fuzzy Algorithms: Difference between revisions

Latest revision as of 21:18, 26 September 2018

Contents

Fuzzy Algorithms

Early Matching

Late Matching

Matching Speed

Accuracy

Navigation menu

Matchcode Optimization:Fuzzy Algorithms: Difference between revisions

Latest revision as of 21:18, 26 September 2018

Fuzzy Algorithms

Early Matching

Late Matching

Matching Speed

Accuracy

Navigation menu

Search