MatchUp Object:Component Properties: Difference between revisions
No edit summary |
No edit summary |
||
Line 71: | Line 71: | ||
|- | |- | ||
|UTF-8 Near | |UTF-8 Near | ||
|Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding. | |'''For Global Matchcodes Only'''. Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding. | ||
It should not be used with domestic matchcodes, which substitute lower order characters for accented characters for keybuilding. | It should not be used with domestic matchcodes, which substitute lower order characters for accented characters for keybuilding. |
Revision as of 17:37, 5 August 2015
MatchUp Object Matchcodes Navigation | |||||||
---|---|---|---|---|---|---|---|
|
The matchcode components tell MatchUp Object which data types to use for creating the match key while the component properties tell MatchUp Object how much of the data to use and what parts.
Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.
In another example, a database might include the area code and the phone number in some records and just the local number in others. By only considering seven characters of the field starting at position four, MatchUp Object has a better chance of detecting a duplicate.
Data Type
See Matchcode Components.
Label
This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.
Size
This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.
Start
This property determines where MatchUp Object begins counting when applying the Size property.
Value | Description |
---|---|
Left | Starts from the first character of the field. This is the most commonly used option. |
Right | Starts from the last character of the field. For example, if the data included a phone number of “949-589-5200” and the size was 7, MatchUp Object would use “5895200” for the match key. |
Position | Starts from a specific position within the field. |
Fuzzy
Fuzzy settings allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.
Value | Description |
---|---|
Phonetex | (pronounced “Fo-NEH-tex”) An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below. |
Soundex | An auditory matching algorithm originally developed by the Department of Immigration in 1917 and later adopted by the USPS. Although the Phonetex algorithm is measurably superior, the Soundex algorithm is presented for users who need to create a matchcode that emulates one from another application. |
Containment | Matches when one record's component is contained in another record. For example, “Smith” is contained in “Smithfield.” |
Frequency | Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.” |
Fast Near | A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches. |
Accurate Near | An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower. |
Frequency Near | Similar to Frequency matching except that you specify how many characters may be different between components. |
UTF-8 Near | For Global Matchcodes Only. Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding.
It should not be used with domestic matchcodes, which substitute lower order characters for accented characters for keybuilding. |
Vowels Only | Only vowels will be compared. Consonants will be removed. |
Consonants Only | Only consonants will be compared. Vowels will be removed. |
Alphas Only | Only alphabetic characters will be compared. |
Numerics Only | Only numeric characters will be compared. Decimals and signs are considered numeric. |
MD Keyboard | An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings. |
Fuzzy Advanced
Please research the definitions of the following advanced algorithms before implementing in a matchcode.
Value | Description |
---|---|
Jaro | Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings. |
Jaro-Winkler | Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters). |
n-Gram | Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp. |
Needleman-Wunch | Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally. |
Smith-Waterman-Gotoh | Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words. |
Dice’s Coefficient | Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams). |
Jaccard Similarity Coefficient | Very similar to Dice’s Coefficient with a slightly different calculation.' |
Overlap Coefficient | Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation. |
Longest Common Substring | Finds the longest common substring between the two strings. |
Double MetaPhone | Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2). |
Distance
This field is context sensitive, depending on the Data Type and Fuzzy algorithm.
Value | Description |
---|---|
Data Type |
|
Fuzzy |
|
The following use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.
- N-Gram
- Jaro
- Jaro-Winkler
- LCS
- Needleman-Wunch
- MD Keyboard
- Smith-Waterman-Gotoh
- Dice’s Coefficient
- Jaccard Similarity Coefficient
- Overlap Coefficient
- Double MetaPhone
Short/Empty Settings
These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.
Value | Description |
---|---|
Initial Only | Will match a full word to an initial (for example, “J” and “John”). |
One Blank Field | Will match a full word to no data (for example, “John” and “”). |
Both Blank Fields | Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see Blank Field Matching. |
Swap
Swap matching is the ability to compare one component to another component. For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components. MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.
For more information see Swap Matching Uses.