RightFielder Object:Config Example

From Melissa Data Wiki
Revision as of 18:02, 24 February 2014 by Tim (talk | contribs) (Created page with "<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microso...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 12"> <meta name=Originator content="Microsoft Word 12"> <link rel=File-List href="mdRightFielder.cfg%20Case%20Studies_files/filelist.xml"> <link rel=themeData href="mdRightFielder.cfg%20Case%20Studies_files/themedata.thmx"> <link rel=colorSchemeMapping href="mdRightFielder.cfg%20Case%20Studies_files/colorschememapping.xml"> <style> </style> </head>

<body lang=EN-US link=blue vlink=purple style='tab-interval:.5in'>

mdRightFielder.cfg: Case Studies<o:p></o:p>

The mdRightFielder.cfg file is a plain text file that users can use to tailor Right Fielder Object’s behavior to meet their specific needs. Generally, this is used when a user has input data with a specific quirk or characteristic that Right Fielder on its own can’t handle properly.

This file is used to override the  default entries from the stock mdRightFielder lookup tables contained in the mdRightFielder.dat data file.  By default, both mdRightFielder.dat and mdRightFielder.cfg are installed in…

C:\Program Files\Melissa DATA\DQT\Data or the respective Melissa Data data directory in UNIX type OS installations.

For complete instructions of available tables and types which can be overridden, as well as syntax and examples, open the mdRightFielder.cfg in a text editor and follow the instructions.

There are 3 types of modifications that can be made in mdRightFielder.cfg:

<![if !supportLists]>·         <![endif]>Lookup Table– The addition or removal of words (and phrases) to the Object’s dictionaries. This essentially expands (or limits) Right Fielder’s vocabulary.

<![if !supportLists]>·         <![endif]>Regular Expression– The addition of regular expressions that are used to recognize specific character patterns (for example, phone numbers, e-mails, etc).

<![if !supportLists]>·         <![endif]>Pattern Table – The addition or removal of patterns of words and phrases. Words and phrases are first identified via Lookup Tables and assigned tokens (specified in the Lookup Table itself). Sequences of tokens (patterns) are matched to entries in this table and transformed into output data.

Case Study 1: Lookup Tables<o:p></o:p>

There are three lookup tables in Right Fielder Object: 

<![if !supportLists]>·         <![endif]>LeftToken – used to recognize words and phrases that usually appear at the start of data (name, company, titles)

<![if !supportLists]>·         <![endif]>MiddleToken – used to recognize words and phrases that usually appear at the middle of the data (addresses, apartments, PO Boxes, etc)

<![if !supportLists]>·         <![endif]>RightToken – used to recognize words and phrases that usually appear at the end  of data (city, state, country)

Say, for example, you are processing a list of car dealerships and the company recognition isn’t working as well as you would like. In evaluating the results, it appears that if Right Fielder knew a bit more about car manufacturers, processing might be a lot more accurate. In this case, you would modify the LeftToken table in this way:

[LeftToken]<o:p></o:p>

FORD,C<o:p></o:p>

TOYOTA,C<o:p></o:p>

CHEVY,C<o:p></o:p>

NISSAN,C<o:p></o:p>

KIA,C<o:p></o:p>

HONDA,C<o:p></o:p>

LINCOLN,C<o:p></o:p>

ALFA ROMEO,C<o:p></o:p>

MOTORS,C<o:p></o:p>

<o:p> </o:p>

Note that it is not necessary for the entries to be sorted. Also, these entries will override any existing dictionary entries (for example, ‘LINCOLN’ is by default a First Name indicator). The second field, containing the ‘C’ indicates what kind of word is being described (it’s ‘token’). Only one token can be used per entry. Each table has different tokens that can be used in it, see the mdRightFielder.cfg for details.

Case Study 2: Regular Expression for a defined Data Type<o:p></o:p>

Regular expression are used to recognize postal codes, e-mail addresses, phone numbers and URLs. These are types of data that can’t well be recognized using a dictionary, but the actual pattern of numbers, letters and punctuation is very useful in identifying the data. For example, Canadian Postal Codes always follow the pattern alpha-digit-alpha-digit-alpha-digit. They are recognized with the regular expression:

(?<=^| )[a-z][0-9][a-z][- ]?[0-9][a-z][0-9](?= |$)<o:p></o:p>

Admittedly, regular expressions are a dark art and are not very easy to understand. Our best advice is to start with simple expressions and gradually add complexity, testing each addition before moving on. There are a few web sites and tools that can greatly ease this trial and error process. I use Rad Software’s Regular Expression Designer (http://www.radsoftware.com.au/regexdesigner/) , but there are many others as well.

Say, for example, you have data that was run through OCR software. Unfortunately, in many cases 0’s and 1’s were accidentally recognized as O’s and l’s. We can enhance Right Fielder’s recognition of abominations such as “Ol234” with this expression:

(?<=^| )[0-9Ol]{5}[- ]?([0-9Ol]{4})?(?= |$)<o:p></o:p>

This regular expression can be broken up into 5 parts:

(?<=^| )<o:p></o:p>

This is a common preamble to most of our regular expressions. This indicates that there must be a break or delimiter of some sort preceding the zip code. This is used because we don’t want to accidentally recognize something that is actually at the tail of a longer string.

[0-9Ol]{5}<o:p></o:p>

This indicates that we need to see any number, an uppercase O or a lowercase l. And we need to see 5 of them in a row.

[- ]?<o:p></o:p>

This indicates that we might see a dash, a space (but we may not see either). The? quantifier means that we can see 0 or exactly 1 iterations of the character.

([0-9Ol]{4})?<o:p></o:p>

This indicates that we need to see any number, an uppercase O or a lowercase l. We need to see 4 of them in a row. However, there’s catch here, because sometimes people omit the Plus 4, so the sub-expression is surrounded by parentheses and followed by the ? quantifier.

(?= |$)<o:p></o:p>

Like the preamble, this indicates that a break or delimiter of some sort must follow the zip code.

<o:p> </o:p>


<o:p> </o:p>

Now that we designed our regular expression, we need to add it to the PostalCodeRegEx table:

[PostalCodeRegEx]<o:p></o:p>

5,(?<=^| )[0-9Ol]{5}[- ]?([0-9Ol]{4})?(?= |$)<o:p></o:p>

<o:p> </o:p>

When a regular expression finds a match, the match is removed from the input data, so a later expression (which may be more fitting) will not find the match. Thus, processing order is important. However, we can’t simply sort regular expressions like we do with lookup tables. Instead, you must provide a number (the 5 in our example) which will indicate it’s place in the regular expression processing order. The lower the number, the sooner it is processed.

Generally, you want expressions that capture larger amounts of data to precede expressions that capture smaller amounts. Our ‘canned’ expressions start at 10 and increment by 10. There are usually not more than 5 or so expressions per table. This example ensures that this expression is the first to be evaluated. However, in this case, it is not likely that order would have made a difference, as it does not conflict with any of the existing expressions.

Experienced reg-exers may be concerned about the preamble and post amble, as it would appear that we’ve forgotten to include many delimiters (ie, tab, pipe, carriage returns, etc). For the purposes of regular expression processing, all delimiters are temporarily transformed into spaces, so the only things that your regular expression really need to be looking for are spaces and the ‘start of string’ and ‘end of string’ indicators. There is one exception to this rule, and that is for the PreProcessRegEx table. Regular expressions in this table are processed on the raw data, before any transforms are done.

Case Study 3: Regular Expression for general data patterns<o:p></o:p>

We briefly mentioned the PreProcessRegEx table in the previous example. This can be one of the most powerful tables at your disposal, as it allows you to address many character-based anomalies that you may see in your data. In addition, it is the only table that allows you to perform regular expression-based replacements.

Say your data sometimes contains a dash between the state and zip code (for example, “Braintree, MA-02184”). This data anomaly will befuddle Right Fielder’s state and zip code recognition abilities. However, we can easily fix this with an entry in the PreProcessRegEx table:

[PreProcessRegEx]<o:p></o:p>

5,(?<=^| |,|[|]|\t|\r|\n)([a-z]{2,3})(?:[-]?)(\d{5}(?:-\d{4})?)(?=$| |,|[|]|\t|\r|\n),$1 $2<o:p></o:p>

<o:p> </o:p>

The regular expression can be broken up into these parts:

(?<=^| |,|[|]|\t|\r|\n)<o:p></o:p>

The regular expression’s preamble. Unlike the other tables, we must search for all sorts of delimiters and breaks, as pre-process regular expressions are matched to the raw input data.<o:p></o:p>

([a-z]{2})<o:p></o:p>

A two-letter sequence (ie, the state). Note that regular expressions are processed case-insensitive (though you can override with the ?-i: option).<o:p></o:p>

(?:[-])<o:p></o:p>

The offending dash. We’ve attached the non-capturing group construct (?:, as we’ll be throwing this group away.<o:p></o:p>

(\d{5}(?:-\d{4})?)<o:p></o:p>

The Zip Code (and optional Plus 4).<o:p></o:p>

(?=$| |,|[|]|\t|\r|\n)<o:p></o:p>

The post amble.<o:p></o:p>

<o:p> </o:p>

<o:p> </o:p>

When a regular expression is found, we can use the “$1 $2” to perform the replacements. Each numbered $ entry indicates a capture group. In our example, $1 indicates whatever is captures by the first capture group ([a-z]{2}), and $2 indicates what is captured by the second capture group (\d{5}(?:-\d{4})?). Non-capture groups (the ones starting with (?:) are not counted.<o:p></o:p>

Case Study 4: Example mdRightFielder.cfg overrides<o:p></o:p>

<o:p> </o:p>

[PreProcessRegEx]<o:p></o:p>

4,(?<=^| |[|]|\t|\r|\n)(.*)(a/o|A/O|c/o|C/O)(.*)(?=$| |[|]|\t|\r|\n),$1 | $3<o:p></o:p>

<o:p> </o:p>

This expression will allow you to identify ‘XYZ Corporation a/o Billy McMailreceiver’ as<o:p></o:p>

<o:p> </o:p>

        Name1: Billy McMailreceiver<o:p></o:p>

     Company1: XYZ Corporation<o:p></o:p>

<o:p> </o:p>

Without this expression the output will be…<o:p></o:p>

<o:p> </o:p>

     Company1: XYZ Corporation a/o Billy McMailreceiver<o:p></o:p>

<o:p> </o:p>

[LeftToken]

;  these entries help uncommon names get recognized as names<o:p></o:p>

SMOKY THE BEAR,F

DONTRELL,F

BOGDAN,F

<o:p> </o:p>

;  many tokens which appear to identify departments are actually entered as Company identifiers by default.<o:p></o:p>

;  the following  entries create distinct department identifiers , overriding tokens sometimes present in companies<o:p></o:p>

IT DEPARTMENT,T

SALES DEPARTMENT, T

QA, T

<o:p> </o:p>

[RightToken]

; this expands identification of unheard of, changed, or vanity city names<o:p></o:p>

ANYTOWN,T,,100

; alternate spelling or fictional country<o:p></o:p>

SHIRE,I,,

<o:p> </o:p>

[PhoneTypeToken]

; yesterdays or tomorrows phone identifiers<o:p></o:p>

MOBILE

PAGER

BLACKBERRY

<o:p> </o:p>

Case Study 5: NOTES<o:p></o:p>

When creating and writing a regular expression in the cfg file, the only commas allowed are the instances which delimit the <id> the <regEx> and the <replace>. This is how RightFielder parses the cfg file edits.

Example of a valid entry ….

[PreProcessRegEx]<o:p></o:p>

5,(?<=^| |,|[|]|\t|\r|\n)([a-z]{2,3})(?:[-]?)(\d{5}(?:-\d{4})?)(?=$| |,|[|]|\t|\r|\n),$1 $2<o:p></o:p>

<o:p> </o:p>

Example of an invalid entry (expression will be ignored….

[PreProcessRegEx]<o:p></o:p>

5,(?<=^| |,|[|]|\t|\r|\n)([a-z]{2,3})(?:[-]?)(\d{5}(?:-\d{4})?)(?=$| |,|[|]|\t|\r|\n),$1 $2<o:p></o:p>

<o:p> </o:p>

The <id> in the above example must be unique. If you create new expressions with the same <id> as another cfg edit or an existing mdRightFielder.dat entry, it will be ignored.<o:p></o:p>

Existing pattern <id>s  in mdRightFielder.dat file start with id=10 and imcrement by 10, so to override and existing pattern, use a single digit <id>, to add a lower priority use a higher multiple.<o:p></o:p>

<o:p> </o:p>

</body>

</html>