Tutorial:Matchcode Editor: Difference between revisions

From Melissa Data Wiki
Jump to navigation Jump to search
 
Line 13: Line 13:
To the right of the list are four buttons: Create Matchcode, Remove Matchcode, Copy Matchcode, and Rename Matchcode.
To the right of the list are four buttons: Create Matchcode, Remove Matchcode, Copy Matchcode, and Rename Matchcode.


[[File:TUT_DQC_MatchUp_MC_02.png|500px]]
[[File:TUT_DQC_MatchUp_MC_02.png|600px]]


===Adding a New Matchcode===
===Adding a New Matchcode===

Latest revision as of 21:42, 14 August 2012


Overview

The Matchcode Editor is a Windows-based application that creates and edits the matchcode file used by MatchUp Object. This program allows developers to customize copies of the original matchcodes that ship with MatchUp Object, or create new matchcodes from scratch.

If you have ever used Melissa Data's MatchUp software for Windows, you will already be familiar with the functionality of the Matchcode Editor.

Matchcode List

This top portion of the interface contains a list of all the matchcodes found in the current matchcode file.

To the right of the list are four buttons: Create Matchcode, Remove Matchcode, Copy Matchcode, and Rename Matchcode.

Adding a New Matchcode

  1. Click the Create Matchcode button.
  2. Type a name for the new matchcode in the pop-up box and click OK.
  3. The Matchcode editor presents a blank matchcode screen with no components.
  4. Begin adding components.

Removing an Existing Matchcode

  1. Select the matchcode to be deleted.
  2. Click the Remove Matchcode button.
  3. Click Yes to confirm the deletion.

Making a Copy of an Existing Matchcode

  1. Select the matchcode to be copied.
  2. Click the Copy Matchcode button.
  3. Type a name for the new matchcode in the pop-up box and click OK.

Renaming an Existing Matchcode

  1. Select the matchcode to be renamed.
  2. Click the Rename Matchcode button.
  3. Type a new name for the matchcode in the pop-up box and click OK.

Matchcode Component List

Below the matchcode list is a list of components used by the currently-selected matchcode.

The list also shows the basic settings for each combination. The right side of the list contains a grid that shows the combinations in which component is used. For more information on how combinations of components are used, see the Component Combinations topic.

Adding a new component to the matchcode

  1. Select the [Select Data Type] dropdown list.
  2. The Matchcode Editor displays a list of 35 available Data Types and an additional General Data Type to choose from.
  3. Select the settings for the new component.
  4. You can click on the different sections of the added component for more settings.
  5. Click OK.
  6. The new component is added as the last component in the matchcode.

Removing a component from a matchcode

  1. Select the Component Data Type dropdown list.
  2. Navigate to the top of the list and select [Remove Component].
  3. Place the focus on another control and the component will be removed from the component list.

Changing the order of components in a matchcode

  1. Click on the name of the component.
  2. Drag the component to the new position.

Matchcode Component Properties

Data Types

The following table lists all the available matchcode components in MatchUp Object:

Component Description
Prefix Prefix of a personal name (Mr, Mrs, Ms, Dr).
First Name A first name.
Middle Name A middle name.
Last Name A last name.
Suffix A suffix from a personal name.
Gender Male/Female/Neutral.
First/Nickname A representative nickname for a first name.
Middle/Nickname A representative nickname for a middle name.
Department/Title A title and/or department name [1].
Company A company name.
Company Acronym A company's acronym [2].
Street Number The street number from an address line [3].
Street Pre-Directional "South" in "3 South Main St".
Street name The street name from an address line.
Street Suffix An address suffix (St, Ave, Blvd).
Street Post-Directional "North" in "3 Main St North".
PO Box PO Boxes also include Farm Routes, Rural Routes, etc.
Street Secondary Apartments, floors, rooms, etc.
Address A single unparsed address line [4].
City A city name, ZIP or Postal code is usually more accurate.
State/Province A state or province name.
Zip9 A full ZIP + 4 code (9 digits) [5].
Zip5 The ZIP Code (5 digits).
Zip4 The +4 extension of a ZIP + 4 code (4 digits).
Postal Code (Canada) A Canadian Postal Code.
City (UK) A city in the United Kingdom.
County (UK) A county in the United Kingdom.
Postcode (UK) A United Kingdom Postcode.
Country A country.
Phone/Fax A phone number [6].
E-mail Address An e-mail address [7].
Credit Card Number A credit card number.
Date A date [8].
Numeric A numeric field [9].
Proximity Allows you to specify a maximum distance in miles between records in which a match will be possible [10].
General Any general information, ID, birthday, SSN, etc.
  1. Company, Company Acronym, Department/Title — Frequently these components don't match exactly because of ‘noise words’ such as “the,” “and,” “agency,” and so on. MatchUp strips these words from these components.
  2. Company Acronym — MatchUp Object converts any multi-word company name into an acronym (for example, “International Business Machines” is squeezed into “IBM”). Single-word company names are left as they are. This conversion is done after noise words are removed.
  3. Street Address Components — The seven street address components (Street Number, Street Pre-Directional, Street Name, Street Suffix, Street Post-Directional, PO Box, Street Secondary) are obtained by splitting up to three address lines. Note that PO Box and/or Street Secondary do not have to appear on their own line, or in a particular field. MatchUp's proprietary “street smart” splitter does all of the work.
  4. Full Address — When using the Full Address component, you are at the mercy of every little deviation in data entry. Because MatchUp Object’s street splitter is so powerful, it is preferable to use street address components instead of the Full Address in nearly all cases. The only exception may be when processing foreign addresses that don’t conform very well to US, Canadian, or UK addressing formats. This is discussed in more detail starting on page 178.
  5. Zip9, Zip5, Zip4, Canadian Postal Code — MatchUp Object removes dashes and spaces from Zip codes. When processing a mix of Canadian Postal Codes and US Zip Codes, use the Zip9 component.
  6. Phone Number — MatchUp Object removes non-numeric characters from phone numbers. Leading ‘1-’ and trailing extensions are stripped if present. Numbers lacking an area code are right justified so that the local dialing code and number are aligned with numbers having area codes. If a data table often has missing or inaccurate area codes (i.e., after a recent area code split), start at the 4th position of the phone number component. Do not use the right-most 7 positions, as badly formatted extensions can sometimes cause the phone number to get coded improperly.
  7. E-Mail Address — MatchUp Object removes illegal characters from e-mail addresses. Incomplete, changed, and commonly misspelled domain names are corrected using the Email Address data table.
  8. Date — MatchUp Object allows you to specify a number of days for which a match will be possible if the records being compared fall within the set number of days apart.
  9. Numeric — This allows you to specify an integer number for which a match will be possible if the record’s unit difference falls within the set number.
  10. Proximity — The proximity component requires you to map in Latitude / Longitude coordinates (Not determined by MatchUp. Can be determined by a product such as GeoCoder or Contact Verify) allowing you to match addresses within a maximum distance setting for this component.

Size

The maximum number of characters from this component to be used by this matchcode. If the data has fewer characters, it will be padded with spaces. Sizing is done after all other properties are applied.

Label

(Optional) A description of the data found in this component. Not all component types use this field. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don't fit any of the other component types.

Maximum Number of Words

This limits the number of words that MatchUp Object will extract from a field when building the match key. for example, if maximum words for a last name were set for 1 and the last name field was "Von Richtofen," MatchUp Object would only use "VON" as the last name.

Start

This property determines where MatchUp Object begins counting when applying the Size property.

  • Left (beginning)
Starts from the first character of the field. This is the most commonly used option.
  • Right (end)
Starts from the last character of the field. In other words, if the data included a phone number of "949-589-5200" and the size was 7, MatchUp Object would use "5895200" for the match key.
  • Position
Starts form a specific position within the field.
  • Word
Starts from the beginning of a specific word. This should only be used if a particular field always has more than one word and first word (or more) can safely be ignored.

Trim

This property tells MatchUp Object to remove excess spaces from the beginning of a piece of data, the end, or both. Usually, this property is always enabled.

Matching Strategies (Fuzzy)

These properties allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.

Phonetex
(pronounced "Fo-NEH-tex") An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below.
Soundex
Another, older, auditory matching algorithm. Although the Phonetex algorithm is measurably superior, the Soundex algorithm is presented for users who need to create a machcode that emulates one from another application.
Containment
Match when one record's component is contained in another record. For example, "Smith" is contained in "Smithfield."
Frequency
Matches the characters in one record's component to the characters in another without any regard to the sequence. For example "abcdef" would match "badcfe."
Fast Near
A typographical matching algorithm. It works best in matching words that don't match because of a few typographical errors. Exactly how many errors is specified on a scale from 1 to 4 (1 being the tightest). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.
Accurate Near
This is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.
Frequency Near
Similar to Frequency matching except a slider lets you specify how many characters may be different between components.
Vowels Only
Only vowels will be compared. Consonants will be removed.
Consonants Only
Only consonants will be compared. Vowels will be removed.
Alphas Only
Only alphabetic characters will be compared.
Numerics Only
Only numeric characters will be compared. Decimals and signs are considered numeric.

Fuzzy Advanced

Please research the definitions of the following advanced algorithms before implementing in a matchcode.

Jaro
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Jaro-Winkler
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).
n-Gram
Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.
Needleman-Wunch
Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally.
Smith-Waterman-Gotoh
Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words.
Dice’s Coefficient
Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams).
Jaccard Similarity Coefficient
Very similar to Dice’s Coefficient with a slightly different calculation.
Overlap Coefficient
Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation.
Longest Common Substring
Finds the longest common substring between the two strings.
Double MetaPhone
Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).
MD Keyboard
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.

Short/Empty Settings

These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.

  • Match if both fields are blank
Match this component if both records contain no data. This is a very important concept in creating matchcodes. See Blank Field Matching for more information.
  • Match if one field is blank
Will match a full word to no data (for example, “John” and “”).
  • Match initial to full field
Will match a full word to an initial (for example, “J” and “John”).

Combinations

Use these check boxes to select which of the 16 possible combinations will use this component. This matrix will grow as you add more components and combinations.

It is easier to visualize the effects of these boxes if you look at the list of matchcode components as well:

It is important to note that each vertical column of check marks designates one acceptable matchcode. For example, the illustration above shows a combination that is made up of 4 matchcodes:

  1. Zip5, Last Name, First Name, Street Number, Street Name
  2. Zip5, Last Name, First Name, PO Box
  3. Zip5, Company, Street Number, Street Name
  4. Zip5, Company, PO Box

Since boxes 1 and 3 in the Street Number row have check marks, the Street Number field is included in matchcode 1 and 3.

Swap Match Pairs

Swap matching is the ability to compare one component to another component.

For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components.

MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.

Configuring a swap pair

  1. Click the button for a swap pair.
  2. The Matchcode Editor displays the swap pair editing dialog.
  3. Select the two components that will be used for this swap pair. The first component is automatically disabled, since it cannot be used with swap matching.
  4. Select the swapping rule:
  5. Both components must match
    The contents of both components must be a match according to fuzzy matching strategy in use for both components. "John Smith" matches "Smith John" but not "Smith <blank>."
    Either component can match
    At least one of the components must match "John Smith" matches both "Smith John" and "Smith <blank>."
  6. Click OK.

For more ideas on how to use Swap Matching, see Swap Matching Uses.