MatchUp Object:Component Properties

MatchUp Object Matchcodes Navigation

Matchcodes
Component Properties
Component Combinations
Blank Field Matching
Matchcode Mapping
Optimizing Matchcodes
Swap Matching Uses

The matchcode components tell MatchUp Object which data types to use for creating the match key while the component properties tell MatchUp Object how much of the data to use and what parts.

Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.

Data Type

See Matchcode Components.

Label

This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.

Size

This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.

Start

This property determines where MatchUp Object begins counting when applying the Size property.

Value	Description
Left	Starts from the first character of the field. This is the most commonly used option.
Right	Starts from the last character of the field. For example, if the data included a phone number of “949-589-5200” and the size was 7, MatchUp Object would use “5895200” for the match key.
Position	Starts from a specific position within the field.

Fuzzy

Fuzzy settings allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.

Value	Description
Phonetex	(pronounced “Fo-NEH-tex”) An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below.
Soundex	An auditory matching algorithm originally developed by the Department of Immigration in 1917 and later adopted by the USPS. Although the Phonetex algorithm is measurably superior, the Soundex algorithm is presented for users who need to create a matchcode that emulates one from another application.
Containment	Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.”
Frequency	Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.”
Fast Near	A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.
Accurate Near	An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.
Frequency Near	Similar to Frequency matching except that you specify how many characters may be different between components.
UTF-8 Near	Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.
Vowels Only	Only vowels will be compared. Consonants will be removed.
Consonants Only	Only consonants will be compared. Vowels will be removed.
Alphas Only	Only alphabetic characters will be compared.
Numerics Only	Only numeric characters will be compared. Decimals and signs are considered numeric.
MD Keyboard	An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.

Fuzzy Advanced

Please research the definitions of the following advanced algorithms before implementing in a matchcode.

Value	Description
Jaro	Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Jaro-Winkler	Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).
n-Gram	Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.
Needleman-Wunch	Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally.
Smith-Waterman-Gotoh	Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words.
Dice’s Coefficient	Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams).
Jaccard Similarity Coefficient	Very similar to Dice’s Coefficient with a slightly different calculation.'
Overlap Coefficient	Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation.
Longest Common Substring	Finds the longest common substring between the two strings.
Double MetaPhone	Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).

Distance

This field is context sensitive, depending on the Data Type and Fuzzy algorithm.

Value	Description
Data Type	Proximity Distance in miles. Range: 0-4000 Numeric Integer number. Date Number of days.
Fuzzy	Fast Near Number of typographical errors. Range: Tight(1) - Loose(4) Accurate Near Number of typographical errors. Range: Tight(1) - Loose(4)

The following use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.

N-Gram
Jaro
Jaro-Winkler
LCS
Needleman-Wunch
MD Keyboard
Smith-Waterman-Gotoh
Dice’s Coefficient
Jaccard Similarity Coefficient
Overlap Coefficient
Double MetaPhone

Short/Empty Settings

These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.

Value	Description
Initial Only	Will match a full word to an initial (for example, “J” and “John”).
One Blank Field	Will match a full word to no data (for example, “John” and “”).
Both Blank Fields	Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see Blank Field Matching.

Swap

Swap matching is the ability to compare one component to another component. For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components. MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.

For more information see Swap Matching Uses.

MatchUp Object:Component Properties

Contents

Data Type

Label

Size

Start

Fuzzy

Fuzzy Advanced

Distance

Short/Empty Settings

Swap

Navigation menu

MatchUp Object:Component Properties

Data Type

Label

Size

Start

Fuzzy

Fuzzy Advanced

Distance

Short/Empty Settings

Swap

Navigation menu

Search