MatchUp Object:Component Properties

From Melissa Data Wiki
Revision as of 20:31, 14 August 2015 by Admin (talk | contribs)
Jump to navigation Jump to search

← MatchUp Object Reference

MatchUp Object Matchcodes Navigation
Matchcodes
Component Properties
Component Combinations
Blank Field Matching
Matchcode Mapping
Optimizing Matchcodes
Swap Matching Uses



The matchcode components tell MatchUp Object which data types to use for creating the match key while the component properties tell MatchUp Object how much of the data to use and what parts.

Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.


Data Type

See Matchcode Components.


Label

This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.


Size

This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.


Start

This property determines where MatchUp Object begins counting when applying the Size property.

Value Description
Left Starts from the first character of the field. This is the most commonly used option.
Right Starts from the last character of the field. For example, if the data included a phone number of “949-589-5200” and the size was 7, MatchUp Object would use “5895200” for the match key.
Position Starts from a specific position within the field.


Fuzzy

Fuzzy settings allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.

Value Description
Phonetex (pronounced “Fo-NEH-tex”) An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below.
Soundex An auditory matching algorithm originally developed by the Department of Immigration in 1917 and later adopted by the USPS. Although the Phonetex algorithm is measurably superior, the Soundex algorithm is presented for users who need to create a matchcode that emulates one from another application.
Containment Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”
Frequency Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”
Fast Near A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.
Accurate Near An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.
Frequency Near Similar to Frequency matching except that you specify how many characters may be different between components.
UTF-8 Near For Global Matchcodes Only. Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.

It should not be used with domestic matchcodes, which substitute lower order characters for accented characters for keybuilding.

Vowels Only Only vowels will be compared. Consonants will be removed.
Consonants Only Only consonants will be compared. Vowels will be removed.
Alphas Only Only alphabetic characters will be compared.
Numerics Only Only numeric characters will be compared. Decimals and signs are considered numeric.
MD Keyboard An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.


Fuzzy Advanced

Please research the definitions of the following advanced algorithms before implementing in a matchcode.

Value Description
Jaro Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Jaro-Winkler Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).
n-Gram Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.
Needleman-Wunch Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally.
Smith-Waterman-Gotoh Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words.
Dice’s Coefficient Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams).
Jaccard Similarity Coefficient Very similar to Dice’s Coefficient with a slightly different calculation.'
Overlap Coefficient Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation.
Longest Common Substring Finds the longest common substring between the two strings.
Double MetaPhone Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).


Distance

This field is context sensitive, depending on the Data Type and Fuzzy algorithm.

Value Description
Data Type
Proximity
Distance in miles. Range: 0-4000
Numeric
Integer number.
Date
Number of days.
Fuzzy
Fast Near
Number of typographical errors. Range: Tight(1) - Loose(4)
Accurate Near
Number of typographical errors. Range: Tight(1) - Loose(4)

The following use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.

  • N-Gram
  • Jaro
  • Jaro-Winkler
  • LCS
  • Needleman-Wunch
  • MD Keyboard
  • Smith-Waterman-Gotoh
  • Dice’s Coefficient
  • Jaccard Similarity Coefficient
  • Overlap Coefficient
  • Double MetaPhone


Short/Empty Settings

These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.

Value Description
Initial Only Will match a full word to an initial (for example, “J” and “John”).
One Blank Field Will match a full word to no data (for example, “John” and “”).
Both Blank Fields Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see Blank Field Matching.


Swap

Swap matching is the ability to compare one component to another component. For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components. MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.

For more information see Swap Matching Uses.