SSIS:MatchUp:Component Properties: Difference between revisions

From Melissa Data Wiki
Jump to navigation Jump to search
Created page with "← SSIS Reference {| class="mw-collapsible" cellspacing="2" style="background-color:#f9f9f9; border:1px solid #aaaaaa; font-size:9pt; color:#0645ad; pa..."
 
No edit summary
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[SSIS:Reference|← SSIS Reference]]
{{SSISMatchUpNav
 
|MatchcodeEditorCollapse=
{| class="mw-collapsible" cellspacing="2" style="background-color:#f9f9f9; border:1px solid #aaaaaa; font-size:9pt; color:#0645ad; padding:8px 8px;"
}}
! style="text-align:left; color:black; border-bottom:1px solid #aaaaaa;"|'''MatchUp Navigation'''
|-
| style="padding-right:220px;"|
|-
|[[SSIS:MatchUp|Overview]]
|-
|[[SSIS:MatchUp Tutorial|Tutorial]]
|-
|
{| class="mw-collapsible mw-collapsed" style="width:100%;"
|- align="left"
! style="color:black;"|[[SSIS:MatchUp:Advanced Configuration|Advanced Configuration]]
|-
|[[SSIS:MatchUp:Advanced Configuration#On-Premise|On-Premise]]
|}
|-
|
{| class="mw-collapsible mw-collapsed" style="width:100%;"
|- align="left"
! style="color:black;"|MatchUp Tabs
|-
|[[SSIS:MatchUp:Matchcode|Matchcode]]
|-
|[[SSIS:MatchUp:Field Mapping|Field Mapping]]
|-
|[[SSIS:MatchUp:Options|Options]]
|-
|[[SSIS:MatchUp:Survivorship Pass-Through|Survivorship/Pass-Through]]
|-
|[[SSIS:MatchUp:Lookup Pass-Through|Lookup Pass-Through Columns]]
|-
|[[SSIS:MatchUp:Output Filter|Output Filter]]
|}
|-
|
{| class="mw-collapsible" style="width:100%;"
|- align="left"
! style="color:black;"|[[SSIS:MatchUp:Matchcode Editor|Matchcode Editor]]
|-
|[[SSIS:MatchUp:Matchcode List|Matchcode List]]
|-
|[[SSIS:MatchUp:Component List|Component List]]
|-
|[[SSIS:MatchUp:Component Properties|Component Properties]]
|-
|[[SSIS:MatchUp:Algorithms|Algorithms]]
|}
|-
|
{| class="mw-collapsible mw-collapsed" style="width:100%;"
|- align="left"
! style="color:black;"|[[SSIS:MatchUp:Matchcodes:Overview|Matchcodes Overview]]
|-
|[[SSIS:MatchUp:Matchcodes:Component Combinations|Component Combinations]]
|-
|[[SSIS:MatchUp:Matchcodes:Blank Field Mapping|Blank Field Mapping]]
|-
|[[SSIS:MatchUp:Matchcodes:Mapping|Mapping]]
|-
|[[SSIS:MatchUp:Matchcodes:Optimization|Optimization]]
|-
|[[SSIS:MatchUp:Matchcodes:Swap Matching|Swap Matching]]
|}
|-
|
{| class="mw-collapsible mw-collapsed" style="width:100%;"
|- align="left"
! style="color:black;"|[[SSIS:MatchUp:Result Codes|Result Codes]]
|-
|[[Returned Result Codes:Enterprise Solutions#Data Quality Components for SSIS|Returned Result Codes]]
|-
|[[Result Codes]]
|}
|}




Line 239: Line 165:
</li>
</li>
</ol>
</ol>
===Global Data Types===
These Global data types will not be visible unless you create a global matchcode.
{|class="alternate01" cellspacing="0"
!Component
!style="border-right:0px;"|Description
|-
|Postal Code
|(Zip &/ plus 4) Complete postal code for a particular delivery point.
|-
|Premises Number
|(Street Number) Alphanumeric indicator within premises field.
|-
|Double Dependent Locality
|Smallest population center data element
|-
|Dependent Locality
|(Urbanization) Smaller population center data element. Dependent on Locality.
|-
|Sub Administrative Area
|(County) Smallest geographic data element.
|-
|Sub National Area
|Arbitrary administrative region below that of the sovereign state.
|-
|Locality
|(City) Most common population center data element.
|-
|Administrative Area
|(State) Most common geographic data element.
|-
|Thoroughfare Leading Type
|Leading thoroughfare type indicator within the Thoroughfare field.
|-
|Thoroughfare Pre-Directional
|(Street Pre Direction) Prefix directional contained within the Thoroughfare field.
|-
|Thoroughfare Name
|(Street Name) Name indicator within the Thoroughfare field
|-
|Thoroughfare Trailing Type
|(Street Suffix) Trailing thoroughfare type indicator within the Thoroughfare field.
|-
|Thoroughfare Post-Directional
|(Street Post Direction) Postfix directional contained within the Thoroughfare field.
|-
|Dependent Thoroughfare Pre-Directional
|Prefix directional contained within the Dependent Thoroughfare field.
|-
|Dependent Thoroughfare Leading Type
|Leading thoroughfare type indicator within the Dependent Thoroughfare field.
|-
|Dependent Thoroughfare Name
|Name indicator within the Dependent Thoroughfare field
|-
|Dependent Thoroughfare Trailing Type
|Trailing thoroughfare type indicator within the Dependent Thoroughfare field.
|-
|Dependent Thoroughfare Post-Directional
|Postfix directional contained within the Dependent Thoroughfare field.
|}


;Size
;Size
Line 244: Line 234:


;Label
;Label
:(Optional) A description of the data found in this component. Not all component types use this field. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don't fit any of the other component types.
:(Optional) A description of the data found in this component. Not all component types use this field. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don't fit any of the other component types.Max size of description is 20 characters.


;Maximum Number of Words
;Maximum Number of Words
Line 266: Line 256:
:;*Word
:;*Word


:Starts from the beginning of a specific word. This should only be used if a particular field always has more than one word and first word (or more) can safely be ignored.
::Starts from the beginning of a specific word. This should only be used if a particular field always has more than one word and first word (or more) can safely be ignored.


;Trim
;Trim
Line 295: Line 285:
;Frequency Near
;Frequency Near
:Similar to Frequency matching except that a slider lets you specify how many characters may be different between components.
:Similar to Frequency matching except that a slider lets you specify how many characters may be different between components.
;UTF-8 Near
:Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.


;Vowels Only
;Vowels Only
Line 344: Line 337:
;MD Keyboard
;MD Keyboard
:An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.
:An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.
==Distance==
This is the property where you set a range for which two records will still match. This field is context sensitive, depending on the Data Type and Fuzzy algorithm.
The following matchcode component Data Types use a specific unit of measure when specifying the Distance:
*'''Proximity''': Distance in miles. Range: 0-4000
*'''Numeric''': Integer number.
*'''Date''': Number of days.
For example: If the Distance is set to 60:
*Two records with dates 20161225 and 20161031 will match. (they are within 60 days)
*Two records with dates 20161225 and 20160430 will match. (they are further tham 60 days apart)
When using the following fuzzy algorithms, the Distance uses a scale of 1-4, rather than a unit of measure or percentage of similarity.
*'''Fast Near''': Number of typographical errors. Range: Tight(1) - Loose(4) with respect to length of strings.
*'''Accurate Near''': Number of typographical errors. Range: Tight(1) - Loose(4) with respect to length of strings.
NOTE: Since these algorithms are not published and the range was originally developed to represent a general sliding scale (narrow choice of precision), we recommend using Near:1 and carefully test before you consider using the higher settings in production, as doing so can quickly return false duplicates. In other words, the 1-4 scale is like when someone rates something on a 1-10 scale – it’s a general scale, not a specific grade.
The following algorithms use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.
*N-Gram
*Jaro
*Jaro-Winkler
*LCS
*Needleman-Wunch
*MD Keyboard
*Smith-Waterman-Gotoh
*Dice’s Coefficient
*Jaccard Similarity Coefficient
*Overlap Coefficient
*Double MetaPhone
For example, if using the Jaro-Winkler algorithm with the Distance set to 80:
*Two records which MatchUp finds to be 84% similar according to the Jaro-Winkler algorithm will match.
*Two records which MatchUp finds to be 62% similar according to the Jaro-Winkler algorithm will not match.
More information on the publically published algorithms can be found here: [[MatchUp Object:Advanced Algorithms|Advanced Algorithms]].





Latest revision as of 22:09, 10 February 2017

← SSIS:Data Quality Components

MatchUp Navigation
Overview
Editions
Tutorial
Advanced Configuration
On-Premise
MatchUp Tabs
Matchcode
Field Mapping
Options
Golden Record
Custom Expression Elements
Survivorship/Pass-Through
Lookup Pass-Through Columns
Output Filter
Matchcode Editor
Matchcode Evaluation
Matchcode List
Component List
Component Properties
Algorithms
Matchcodes Overview
Component Combinations
Blank Field Mapping
Mapping
Optimization
Swap Matching
Result Codes
Result Codes



Data Types

The following table lists all the available matchcode components in MatchUp.

Component Description
Prefix Prefix of a personal name (Mr, Mrs, Ms, Dr)
First Name A first name
Middle Name A middle name
Last Name A last name
Suffix A suffix from a personal name
Gender Male/Female/Neutral
First/Nickname A representative nickname for a first name
Middle/Nickname A representative nickname for a middle name
Department/Title A title and/or department name [1]
Company A company name
Company Acronym A company's acronym [2]
Street Number The street number from an address line [3]
Street Pre-Directional "South" in "3 South Main St"
Street name The street name from an address line
Street Suffix An address suffix (St, Ave, Blvd)
Street Post-Directional "North" in "3 Main St North"
PO Box PO Boxes also include Farm Routes, Rural Routes, etc.
Street Secondary Apartments, floors, rooms, etc.
Address A single unparsed address line [4]
City A city name, ZIP or Postal code is usually more accurate
State/Province A state or province name
Zip9 A full ZIP + 4 code (9 digits) [5]
Zip5 The ZIP Code (5 digits)
Zip4 The +4 extension of a ZIP + 4 code (4 digits)
Postal Code (Canada) A Canadian Postal Code
City (UK) A city in the United Kingdom
County (UK) A county in the United Kingdom
Postcode (UK) A United Kingdom Postcode
Country A country
Phone/Fax A phone number [6]
E-mail Address An e-mail address [7]
Credit Card Number A credit card number
Date A date [8]
Numeric A numeric field [9]
Proximity Allows you to specify a maximum distance in miles between records in which a match will be possible [10]
General Any general information, ID, birthday, SSN, etc.


  1. Company, Company Acronym, Department/Title
    Frequently these components don't match exactly because of ‘noise words’ such as “the,” “and,” “agency,” and so on. MatchUp strips these words from these components.
  2. Company Acronym
    MatchUp converts any multi-word company name into an acronym (for example, “International Business Machines” is squeezed into “IBM”). Single-word company names are left as they are. This conversion is done after noise words are removed.
  3. Street Address Components
    The seven street address components (Street Number, Street Pre-Directional, Street Name, Street Suffix, Street Post-Directional, PO Box, Street Secondary) are obtained by splitting up to three address lines. Note that PO Box and/or Street Secondary do not have to appear on their own line, or in a particular field. MatchUp's proprietary “street smart” splitter does all of the work.
  4. Full Address
    When using the Full Address component, you are at the mercy of every little deviation in data entry. Because MatchUp’s street splitter is so powerful, it is preferable to use street address components instead of the Full Address in nearly all cases. The only exception may be when processing foreign addresses that don’t conform very well to US, Canadian or UK addressing formats. This is discussed in more detail starting on page 178.
  5. Zip9, Zip5, Zip4, Canadian Postal Code
    MatchUp removes dashes and spaces from Zip codes. When processing a mix of Canadian Postal Codes and US Zip Codes, use the Zip9 component.
  6. Phone Number
    MatchUp removes non-numeric characters from phone numbers. Leading ‘1-’ and trailing extensions are stripped if present. Numbers lacking an area code are right justified so that the local dialing code and number are aligned with numbers having area codes. If a data table often has missing or inaccurate area codes (i.e., after a recent area code split), start at the 4th position of the phone number component. Do not use the right-most 7 positions, as badly formatted extensions can sometimes cause the phone number to get coded improperly.
  7. E-Mail Address
    MatchUp removes illegal characters from e-mail addresses. Incomplete, changed, and commonly misspelled domain names are corrected using the Email Address data table.
  8. Date
    MatchUp allows you to specify a number of days for which a match will be possible if the records being compared fall within the set number of days apart.
  9. Numeric
    This allows you to specify an integer number for which a match will be possible if the record’s unit difference falls within the set number.
  10. Proximity
    The proximity component requires you to map in Latitude / Longitude coordinates (Not determined by MatchUp. Can be determined by a product such as GeoCoder or Contact Verify) allowing you to match addresses within a maximum distance setting for this component.


Global Data Types

These Global data types will not be visible unless you create a global matchcode.

Component Description
Postal Code (Zip &/ plus 4) Complete postal code for a particular delivery point.
Premises Number (Street Number) Alphanumeric indicator within premises field.
Double Dependent Locality Smallest population center data element
Dependent Locality (Urbanization) Smaller population center data element. Dependent on Locality.
Sub Administrative Area (County) Smallest geographic data element.
Sub National Area Arbitrary administrative region below that of the sovereign state.
Locality (City) Most common population center data element.
Administrative Area (State) Most common geographic data element.
Thoroughfare Leading Type Leading thoroughfare type indicator within the Thoroughfare field.
Thoroughfare Pre-Directional (Street Pre Direction) Prefix directional contained within the Thoroughfare field.
Thoroughfare Name (Street Name) Name indicator within the Thoroughfare field
Thoroughfare Trailing Type (Street Suffix) Trailing thoroughfare type indicator within the Thoroughfare field.
Thoroughfare Post-Directional (Street Post Direction) Postfix directional contained within the Thoroughfare field.
Dependent Thoroughfare Pre-Directional Prefix directional contained within the Dependent Thoroughfare field.
Dependent Thoroughfare Leading Type Leading thoroughfare type indicator within the Dependent Thoroughfare field.
Dependent Thoroughfare Name Name indicator within the Dependent Thoroughfare field
Dependent Thoroughfare Trailing Type Trailing thoroughfare type indicator within the Dependent Thoroughfare field.
Dependent Thoroughfare Post-Directional Postfix directional contained within the Dependent Thoroughfare field.


Size
The maximum number of characters from this component to be used by this matchcode. If the data has fewer characters, it will be padded with spaces. Sizing is done after all other properties are applied
Label
(Optional) A description of the data found in this component. Not all component types use this field. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don't fit any of the other component types.Max size of description is 20 characters.
Maximum Number of Words
This limits the number of words that MatchUp will extract from a field when building the match key. for example, if maximum words for a last name were set for 1 and the last name field was "Von Richtofen," MatchUp would only use "VON" as the last name.
Start
This property determines where MatchUp begins counting when applying the Size property.
  • Left (beginning)
Starts from the first character of the field. This is the most commonly used option.
  • Right (end)
Starts from the last character of the field. In other words, if the data included a phone number of "949-589-5200" and the size was 7, MatchUp would use "5895200" for the match key.
  • Position
Starts form a specific position within the field.
  • Word
Starts from the beginning of a specific word. This should only be used if a particular field always has more than one word and first word (or more) can safely be ignored.
Trim
This property tells MatchUp to remove excess spaces from the beginning of a piece of data, the end or both. Usually, this property is always enabled.


Matching Strategies (Fuzzy)

These properties allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.

Phonetex
(pronounced "Fo-NEH-tex") An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below.
Soundex
Another, older, auditory matching algorithm. Although the Phonetex algorithm is measurable superior, the Soundex algorithm is presented for users who need to create a machcode that emulates one from another application.
Containment
Match when one record's component is contained in another record. For example, "Smith" is contained in "Smithfield"
Frequency
Matches the characters in one record's component to the characters in another without any regard to the sequence. For example "abcdef" would match "badcfe"
Fast Near
A typographical matching algorithm. It works best in matching words that don't match because of a few typographical errors. Exactly how many errors is specified on a scale from 1 to 4 (1 being the tightest). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches.
Accurate Near
This is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower.
Frequency Near
Similar to Frequency matching except that a slider lets you specify how many characters may be different between components.
UTF-8 Near
Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.
Vowels Only
Only vowels will be compared. Consonants will be removed.
Consonants Only
Only consonants will be compared. Vowels will be removed.
Alphas Only
Only alphabetic characters will be compared.
Numerics Only
Only numeric characters will be compared. Decimals and signs are considered numeric.


Fuzzy Advanced

Please research the definitions of the following advanced algorithms before implementing in a matchcode.

Jaro
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings.
Jaro-Winkler
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters).
n-Gram
Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp.
Needleman-Wunch
Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally.
Smith-Waterman-Gotoh
Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words.
Dice’s Coefficient
Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams).
Jaccard Similarity Coefficient
Very similar to Dice’s Coefficient with a slightly different calculation.
Overlap Coefficient
Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation.
Longest Common Substring
Finds the longest common substring between the two strings.
Double MetaPhone
Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2).
MD Keyboard
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings.


Distance

This is the property where you set a range for which two records will still match. This field is context sensitive, depending on the Data Type and Fuzzy algorithm.

The following matchcode component Data Types use a specific unit of measure when specifying the Distance:

  • Proximity: Distance in miles. Range: 0-4000
  • Numeric: Integer number.
  • Date: Number of days.

For example: If the Distance is set to 60:

  • Two records with dates 20161225 and 20161031 will match. (they are within 60 days)
  • Two records with dates 20161225 and 20160430 will match. (they are further tham 60 days apart)


When using the following fuzzy algorithms, the Distance uses a scale of 1-4, rather than a unit of measure or percentage of similarity.

  • Fast Near: Number of typographical errors. Range: Tight(1) - Loose(4) with respect to length of strings.
  • Accurate Near: Number of typographical errors. Range: Tight(1) - Loose(4) with respect to length of strings.


NOTE: Since these algorithms are not published and the range was originally developed to represent a general sliding scale (narrow choice of precision), we recommend using Near:1 and carefully test before you consider using the higher settings in production, as doing so can quickly return false duplicates. In other words, the 1-4 scale is like when someone rates something on a 1-10 scale – it’s a general scale, not a specific grade.


The following algorithms use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.

  • N-Gram
  • Jaro
  • Jaro-Winkler
  • LCS
  • Needleman-Wunch
  • MD Keyboard
  • Smith-Waterman-Gotoh
  • Dice’s Coefficient
  • Jaccard Similarity Coefficient
  • Overlap Coefficient
  • Double MetaPhone


For example, if using the Jaro-Winkler algorithm with the Distance set to 80:

  • Two records which MatchUp finds to be 84% similar according to the Jaro-Winkler algorithm will match.
  • Two records which MatchUp finds to be 62% similar according to the Jaro-Winkler algorithm will not match.


More information on the publically published algorithms can be found here: Advanced Algorithms.


Short/Empty Settings

These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.

Match if both fields are blank
Match this component if both records contain no data. This is a very important concept in creating matchcodes. See Blank Field Matching later in this chapter for more information.
Match if one field is blank
Will match a full word to no data (for example, “John” and “”).
Match initial to full field
Will match a full word to an initial (for example, “J” and “John”).


Combinations

Uses these check boxes to select which of the sixteen possible combinations will use this component. This matrix will grow as you add more components and combinations.

It is easier to visualize the effects of these boxes if you look at the list of matchcode components as well:

It is important to note that each VERTICAL column of check marks designates one acceptable matchcode. For example, the illustration above shows a combination that is made up of 4 matchcodes:

  1. Zip5, Last Name, First Name, Street Number, Street Name
  2. Zip5, Last Name, First Name, PO Box
  3. Zip5, Company, Street Number, Street Name
  4. Zip5, Company, PO Box

Since boxes 1 and 3 in the Street Number row have check marks, the Street Number field is included in matchcode 1 and 3.


Swap Match Pairs

Swap matching is the ability to compare one component to another component.

For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components. MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.

Configure a Swap Pair

  1. Click the button for a swap pair.
  2. The Matchcode Editor displays the swap pair editing dialog.
  3. Select the two components that will be used for this swap pair. The first component is automatically disabled, since it cannot be used with swap matching.
  4. Select the swapping rule:
    • Both components must match - The contents of both components must be a match according to fuzzy matching strategy in use for both components. "John Smith" matches "Smith John" but not "Smith <blank>."
    • Either component can match - At least one of the components must match "John Smith" matches both "Smith John" and "Smith <blank>."
  5. Click OK.
  6. For more ideas on how to use Swap Matching, see Swap Matching Uses.