MatchUp Object:Component Properties: Difference between revisions

From Melissa Data Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 9: Line 9:
Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.
Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.


In another example, a database might include the area code and the phone number in some records and just the local number in others. By only considering seven characters of the field starting at position four, MatchUp Object has a better chance of detecting a duplicate.


==Data Type==
==Data Type==
See [[MatchUp Object:Matchcodes#Matchcode Components|Matchcode Components]].
See [[MatchUp Object:Matchcodes#Matchcode Components|Matchcode Components]].


==Label==
==Label==
This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.
This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.


==Size==
==Size==
This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.
This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.


==Start==
==Start==
Line 36: Line 38:
|Starts from a specific position within the field.
|Starts from a specific position within the field.
|}
|}


==Fuzzy==
==Fuzzy==
Line 64: Line 67:
|Frequency Near
|Frequency Near
|Similar to Frequency matching except that you specify how many characters may be different between components.
|Similar to Frequency matching except that you specify how many characters may be different between components.
|-
|UTF-8 Near
|Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.
|-
|-
|Vowels Only
|Vowels Only
Line 80: Line 86:
|An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.
|An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.
|}
|}


==Fuzzy Advanced==
==Fuzzy Advanced==
Line 118: Line 125:
|Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).
|Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).
|}
|}


==Distance==
==Distance==
Line 150: Line 158:
*Overlap Coefficient
*Overlap Coefficient
*Double MetaPhone
*Double MetaPhone


==Short/Empty Settings==
==Short/Empty Settings==
Line 167: Line 176:
|Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see [[MatchUp Object:Blank Field Matching|Blank Field Matching]].
|Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see [[MatchUp Object:Blank Field Matching|Blank Field Matching]].
|}
|}


==Swap==
==Swap==

Latest revision as of 22:07, 10 February 2017

← MatchUp Object Reference

MatchUp Object Matchcodes Navigation
Matchcodes
Component Properties
Component Combinations
Blank Field Matching
Matchcode Mapping
Optimizing Matchcodes
Swap Matching Uses



The matchcode components tell MatchUp Object which data types to use for creating the match key while the component properties tell MatchUp Object how much of the data to use and what parts.

Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.


Data Type

See Matchcode Components.


Label

This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.


Size

This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.


Start

This property determines where MatchUp Object begins counting when applying the Size property.

Value Description
Left Starts from the first character of the field. This is the most commonly used option.
Right Starts from the last character of the field. For example, if the data included a phone number of “949-589-5200” and the size was 7, MatchUp Object would use “5895200” for the match key.
Position Starts from a specific position within the field.


Fuzzy

Fuzzy settings allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.

Value Description
Phonetex (pronounced “Fo-NEH-tex”) An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below.
Soundex An auditory matching algorithm originally developed by the Department of Immigration in 1917 and later adopted by the USPS. Although the Phonetex algorithm is measurably superior, the Soundex algorithm is presented for users who need to create a matchcode that emulates one from another application.
Containment Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”
Frequency Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”
Fast Near A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.
Accurate Near An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.
Frequency Near Similar to Frequency matching except that you specify how many characters may be different between components.
UTF-8 Near Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.
Vowels Only Only vowels will be compared. Consonants will be removed.
Consonants Only Only consonants will be compared. Vowels will be removed.
Alphas Only Only alphabetic characters will be compared.
Numerics Only Only numeric characters will be compared. Decimals and signs are considered numeric.
MD Keyboard An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.


Fuzzy Advanced

Please research the definitions of the following advanced algorithms before implementing in a matchcode.

Value Description
Jaro Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Jaro-Winkler Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).
n-Gram Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.
Needleman-Wunch Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally.
Smith-Waterman-Gotoh Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words.
Dice’s Coefficient Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams).
Jaccard Similarity Coefficient Very similar to Dice’s Coefficient with a slightly different calculation.'
Overlap Coefficient Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation.
Longest Common Substring Finds the longest common substring between the two strings.
Double MetaPhone Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).


Distance

This field is context sensitive, depending on the Data Type and Fuzzy algorithm.

Value Description
Data Type
Proximity
Distance in miles. Range: 0-4000
Numeric
Integer number.
Date
Number of days.
Fuzzy
Fast Near
Number of typographical errors. Range: Tight(1) - Loose(4)
Accurate Near
Number of typographical errors. Range: Tight(1) - Loose(4)

The following use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.

  • N-Gram
  • Jaro
  • Jaro-Winkler
  • LCS
  • Needleman-Wunch
  • MD Keyboard
  • Smith-Waterman-Gotoh
  • Dice’s Coefficient
  • Jaccard Similarity Coefficient
  • Overlap Coefficient
  • Double MetaPhone


Short/Empty Settings

These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.

Value Description
Initial Only Will match a full word to an initial (for example, “J” and “John”).
One Blank Field Will match a full word to no data (for example, “John” and “”).
Both Blank Fields Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see Blank Field Matching.


Swap

Swap matching is the ability to compare one component to another component. For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components. MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.

For more information see Swap Matching Uses.