Matchcode Optimization:Advanced Component Types

From Melissa Data Wiki
Revision as of 23:37, 21 September 2018 by Admin (talk | contribs)
Jump to navigation Jump to search

← MatchUp Hub

Matchcode Optimization Navigation
Matchcode Optimization
First Component
Fuzzy Algorithms
Swap Matching
Blank Matching
Advanced Component Types
Algorithms
Accunear
Alphas
Consonants
Containment
Dice's Coefficient
Double Metaphone
Exact
Fast Near
Frequency
Frequency Near
Jaccard Similarity Coefficient
Jaro
Jaro-Winkler
Longest Common Substring (LCS)
MD Keyboard
Needleman-Wunsch
N-Gram
Numeric
Overlap Coefficient
Phonetex
Smith-Waterman-Gotoh
Soundex
UTF8 Near
Vowels


Advanced Matchcode Data Type

Specifics

Matchcode Components

Summary

Most matchcode component data types specify the format of the source data, and any advanced operations that need to be performed on that component are specified in its properties. There are three exceptions, which also require a unit range of variance that will still constitute a match:

  • Date (days)
  • Numeric (units)
  • Proximity (miles)
Returns

A match if the distance between two records being matched is within the configured range.

Example Matchcode Usage 1

Example Data 1

NAME DATE RESULT
John 19980422 Match Found
John 19980426 Match Found
John 20181107 Unique


Example Matchcode Usage 2

Example Data 2

COMPANY EMPLOYEES RESULT
Wilson Elec 640 Match Found
Wilsons 15 Match Found
Wilson Corp 623 Match Found


Example Matchcode Usage 3

Example Data 3

LATITUDE LONGITUDE RESULT
33.63757 -117.6073 Match Found
33.637466 -117.609415 Match Found
33.650388 -117.837956 Unique


Performance
Slower Faster
Matches
More Matches Greater Accuracy


Recommended Usage

Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database.

Small batch runs or larger batch runs when higher listed matchcode components have efficiently grouped records by clustering and therefore reduced the number of records that need to have the unit difference math performed.

Not Recommended For

Large or enterprise level batch runs. Since the proximity must be evaluated for each record comparison, throughput will be very slow. Each swapping attempt takes a late speed hit similar to when using a fuzzy algorithm.