RightFielder Object:Config Example: Difference between revisions

From Melissa Data Wiki
Jump to navigation Jump to search
Tim (talk | contribs)
Tim (talk | contribs)
Line 41: Line 41:
Regular expression are used to recognize postal codes, e-mail addresses, phone numbers and URLs. These are types of data that can’t well be recognized using a dictionary, but the actual pattern of numbers, letters and punctuation is very useful in identifying the data. For example, Canadian Postal Codes always follow the pattern alpha-digit-alpha-digit-alpha-digit. They are recognized with the regular expression:
Regular expression are used to recognize postal codes, e-mail addresses, phone numbers and URLs. These are types of data that can’t well be recognized using a dictionary, but the actual pattern of numbers, letters and punctuation is very useful in identifying the data. For example, Canadian Postal Codes always follow the pattern alpha-digit-alpha-digit-alpha-digit. They are recognized with the regular expression:


<span style='font-size:12.0pt;font-family:"Courier New"'>(?<=^| )[a-z][0-9][a-z][- ]?[0-9][a-z][0-9](?= |$)</span> <br>
<span style='font-size:11.0pt;font-family:"Courier New"'>(?<=^| )[a-z][0-9][a-z][- ]?[0-9][a-z][0-9](?= |$)</span> <br>


*Start with simple expressions and gradually add complexity,  
*Start with simple expressions and gradually add complexity,  

Revision as of 21:29, 24 February 2014

mdRightFielder.cfg Case Studies

The mdRightFielder.cfg file is a plain text file that users can use to override the default entries contained in the mdRightFielder.dat data file.

For complete instructions of available tables and types which can be overridden, as well as syntax and examples, open the mdRightFielder.cfg in a text editor and follow the instructions.

There are 3 types of modifications that can be made in mdRightFielder.cfg:


Lookup Table Overrides

This is the addition or removal of words (and phrases) to the Object’s dictionaries. This essentially expands (or limits) Right Fielder’s vocabulary.

There are three lookup tables in Right Fielder Object:

  • LeftToken – used to recognize words and phrases that usually appear at the start of data (name, company, titles)
  • MiddleToken – used to recognize words and phrases that usually appear at the middle of the data (addresses, apartments, PO Boxes, etc)
  • RightToken – used to recognize words and phrases that usually appear at the end of data (city, state, country)

Example: You're processing a database of car dealerships and the company recognition isn’t fielding car companies correctly. Processing might be a lot more accurate if you would modify the LeftToken table in this way:

[LeftToken]
FORD,C
TOYOTA,C
CHEVY,C
NISSAN,C
KIA,C
HONDA,C
LINCOLN,C
ALFA ROMEO,C
MOTORS,C

Notes:

  • It is not necessary for the entries to be sorted.
  • These entries will override any existing dictionary entries (for example, ‘LINCOLN’ is by default a First Name indicator). The second field, containing the ‘C’ indicates what kind of word is being described (it’s ‘token’).
  • Only one token can be used per entry. Each table has different tokens that can be used in it, see the mdRightFielder.cfg for details.

Regular Expression Overrides

The addition of regular expressions that are used to recognize specific character patterns (for example, phone numbers, e-mails, etc).

Regular expression are used to recognize postal codes, e-mail addresses, phone numbers and URLs. These are types of data that can’t well be recognized using a dictionary, but the actual pattern of numbers, letters and punctuation is very useful in identifying the data. For example, Canadian Postal Codes always follow the pattern alpha-digit-alpha-digit-alpha-digit. They are recognized with the regular expression:

(?<=^| )[a-z][0-9][a-z][- ]?[0-9][a-z][0-9](?= |$)

  • Start with simple expressions and gradually add complexity,
  • Test each addition before moving on.
  • Use third party Regex builder tools to greatly ease this trial and error process.

Say, for example, you have data that was run through OCR software. Unfortunately, in many cases 0’s and 1’s were accidentally recognized as O’s and l’s. We can enhance Right Fielder’s recognition of abominations such as “Ol234” with this expression:

(?<=^| )[0-9Ol]{5}[- ]?([0-9Ol]{4})?(?= |$)

This regular expression can be broken up into 5 parts:

(?<=^| ) This is a common preamble to most of our regular expressions. This indicates that there must be a break or delimiter of some sort preceding the zip code. This is used because we don’t want to accidentally recognize something that is actually at the tail of a longer string.

[0-9Ol]{5} This indicates that we need to see any number, an uppercase O or a lowercase l. And we need to see 5 of them in a row.

[- ]? This indicates that we might see a dash, a space (but we may not see either). The? quantifier means that we can see 0 or exactly 1 iterations of the character.

([0-9Ol]{4})? This indicates that we need to see any number, an uppercase O or a lowercase l. We need to see 4 of them in a row. However, there’s catch here, because sometimes people omit the Plus 4, so the sub-expression is surrounded by parentheses and followed by the ? quantifier.

(?= |$) Like the preamble, this indicates that a break or delimiter of some sort must follow the zip code.

Pattern Table Overrides

The addition or removal of patterns of words and phrases. Words and phrases are first identified via Lookup Tables and assigned tokens (specified in the Lookup Table itself). Sequences of tokens (patterns) are matched to entries in this table and transformed into output data.


Example mdRightFielder.cfg Overrides

[PreProcessRegEx]
4,(?<=^| |[|]|\t|\r|\n)(.*)(a/o|A/O|c/o|C/O)(.*)(?=$| |[|]|\t|\r|\n),$1 | $3

The above expression will allow you to identify ‘XYZ Corporation a/o Billy McMailreceiver’ as

Name1: Billy McMailreceiver
Company1: XYZ Corporation

Without this expression the output will be…

Company1: XYZ Corporation a/o Billy McMailreceiver


[LeftToken]
;these entries help uncommon names get recognized as names
SMOKY THE BEAR,F
DONTRELL,F
BOGDAN,F


; many tokens which appear to identify departments are actually entered as Company identifiers by default.
; the following entries create distinct department identifiers , overriding tokens sometimes present in companies
IT DEPARTMENT,T
SALES DEPARTMENT, T
QA, T


[RightToken]
; this expands identification of unheard of, changed, or vanity city names
ANYTOWN,T,,100
; alternate spelling or fictional country
SHIRE,I,,


[PhoneTypeToken]
; yesterdays or tomorrows phone identifiers
MOBILE
PAGER
BLACKBERRY

NOTES on cfg overrides

When creating and writing a regular expression in the cfg file, the only commas allowed are the instances which delimit the <id> the <regEx> and the <replace>. This is how RightFielder parses the cfg file edits.

Example of a valid entry ….

[PreProcessRegEx]
5,(?<=^| |,|[|]|\t|\r|\n)([a-z]{2,3})(?:[-]?)(\d{5}(?:-\d{4})?)(?=$| |,|[|]|\t|\r|\n),$1 $2

Example of an invalid entry (expression will be ignored….
[PreProcessRegEx]
5,(?<=^| |,|[|]|\t|\r|\n)([a-z]{2,3})(?:[-]?)(\d{5}(?:-\d{4})?)(?=$| |,|[|]|\t|\r|\n),$1 $2

The <id> in the above example must be unique. If you create new expressions with the same <id> as another cfg edit or an existing mdRightFielder.dat entry, it will be ignored.
Existing pattern <id>s in mdRightFielder.dat file start with id=10 and imcrement by 10, so to override and existing pattern, use a single digit <id>, to add a lower priority use a higher multiple