Pentaho:MatchUp:Options

From Melissa Data Wiki
Jump to navigation Jump to search

← Data Quality Components for Pentaho

MatchUp Navigation
Overview
Tutorial
Advanced Configuration
On-Premise
MatchUp Tabs
Matchcode
Field Mapping
Options
Source Pass-Through Columns
Lookup Pass-Through Columns
Output Filter
Matchcode Editor
Matchcode List
Component List
Component Properties
Algorithms
Expression Elements
Matchcodes
Result Codes



The Options tab allows you to specify which of the available output results you want returned.



Output Columns

In addition to the source data, you will often want to output processing information about the disposition of a record. This allows you to analyze the results in a number of ways.

Result Codes
This field displays the results of the comparison, whether the record is unique or is a duplicate, was suppressed or intersected, which matchcode combination resulted in a match, etc. See MatchUp Result Codes for a list of possible Result Codes that the component can return.
Dupe Group
Each group of matching records is assigned a sequential unique group number. This field displays the group number that the record, whether unique or a member of a duplicate grouping, is in.
Dupe Count
This field displays the number of matching records in each Dupe Group.
Matchcode Key
Based on the matchcode and matchcodes component used to process the source table(s), every record has a matchkey built. It is this key, a representation of the record, that is used in deduping. This field will be populated with the key and is useful in analyzing the output results.


Lookup Options

*This section is GREYED-OUT When there is only a single upstream source connected to the MatchUp Component.

When a second upstream data source is connected to the Lookup pin, that data source will be used as a filter.

List Suppress
Source pin records that match any record from the Lookup Pin will not be returned with an Output Result Code. They will be marked as Suppressed.
List Intersect
Only Source pin records that match any record from the Lookup Pin will be returned with an Output Result Code (Unique or Has Duplicate result code). The second to nth Source records that match the Lookup record will be marked with a Duplicate Result Code. Source pin records that do not match any Lookup pin records will be returned with a Non-Intersected Result Code.
No Purge
Source pin records that match other Source pin records will not be matched. In other words, a suppressed group or an intersected group will be returned as suppressed or intersected, but each record will have their own Dupe Group number.


Golden Record

The Golden Record Selection Option is used for selecting the best record amongst a group of duplicate records as your remaining Master Record. This allows for having a single accurate representation of each entity in your data. Thus in a group of duplicate records for example:


Name Address Last Update
John Doe 123 Main St 10/25/2003
John Doe 123 Main St 4/16/2012
John Doe 123 Main St 8/6/2008

We can prioritize the selection of the Golden Record by latest update. The selected remaining record will be:


Name Address Last Update
John Doe 123 Main St 4/16/2012

The component allows for using pre-defined algorithms that are commonly used to select the Golden Record, or you can also write your own custom expressions for flexible rule generation.



There are four algorithms that can be modified by clicking the ".." button to the right of its respective algorithm.

Multiple algorithms can be selected to evaluate the Golden Record. The algorithms are ordered from 1-4 in order of priority. Select the rule and click either the up arrow or down arrow to change this order. Golden Record selection will first be evaluated by the first selected algorithm in the list. If a Golden Record cannot be determined using the first selected algorithm, it will automatically cascade down to the second selected algorithm to re-evaluate the ties. The component will continue to cascade through all selected algorithms until a single golden record can be evaluated or until all the selected algorithms have been applied.

The four algorithms are Last Updated, Most Complete, Data Quality Score, and Custom Expression.


Last Updated

This algorithm allows for selecting the Golden Record based on the newest or oldest Date/Time.



"Last Updated" Column
Input requires a valid column of type Date/Time (DT_Date).
Oldest/Newest Date
Select whether the Golden Record should be selected based on which one has the Oldest/Newest Date.
Example
For the following data:
Name Address Last Update
John Doe 123 Main St 10/25/2003
John Doe 123 Main St 4/16/2012
John Doe 123 Main St 8/6/2008
Selecting the Newest Date for the “Last Update” column will yield the following results:


Name Address Last Update Results
John Doe 123 Main St 4/16/2012 MS02
John Doe 123 Main St 8/6/2008 MS03
John Doe 123 Main St 10/25/2003 MS03


The record containing MS02 is the selected Golden Record.


Most Complete

This algorithm allows for selecting the Golden Record based on the completeness/length of information.



The options allows for selecting multiple columns for evaluation. In a group of duplicate records, the Golden Record will be selected based on the selected columns has the longest concatenated string.

Example:

For the following data:

Name Address
Joseph Doe 123 Main St
Joe Doe 123 Main St
J. Doe 123 Main St


Selecting the Name and Address columns will yield the following results:


Name Address Results
Joseph Doe 123 Main St MS02
Joe Doe 123 Main St MS03
J. Doe 123 Main St MS03

The record containing MS02 is the selected Golden Record.


Data Quality Score

This algorithm allows for selecting the Golden Record based on the Quality of Your Data. The Data Quality Score is required to be used in tandem with other Melissa Data Components.



Column Containing Result Codes
Input requires a valid column containing the Result Codes returned by a Melissa Data Component.


Rule

There are a total of 6 rules that can be utilized.

  • Check the box next to the rule(s) that you wish to use.
  • Rules are also ordered from 1-6 in order of priority. Select the rule and click either the up arrow or down arrow to change this order.
Data Quality Score
The Data Quality Score evaluates all possible results combined from Address, Email, Name, Phone and GeoCode in a single column.
Address Quality Score
The Address Quality Score evaluates your Address Results and selects which one has the best address composition and most deliverable.
Name Quality Score
The Name Quality Score evaluates your Name Results and selects which one has the most probable real person’s name.
Phone Quality Score
The Phone Quality Score evaluates your Phone Results and selects which one has the most valid 10-digit phone number.
Email Quality Score
The Email Quality Score evaluates your Email Results and selects which one has the most correct and deliverable email address.
GeoCode Quality Score
The GeoCode Quality Score evaluates your GeoCode Results and selects which one has the highest level of GeoCoding accuracy.


Custom Expression

The Custom Expression Algorithm allows you to set up a custom expression of your own design.



Expression

Use a pre-built expression
To use a pre-built expression, select one from the drop-down list. You may also remove the selected pre-built expression by clicking the self-titled button.
Use the specified expression
Alternatively, you may create your own expression in this field.
  • Clicking the text button will check your custom expression for correct syntax.
  • To save your custom built expression, click the "Save the above expression as a new Pre-Built Expression..." button.

Precedence

Choose one of the two radio buttons here to designate which result takes precendence over another when comparing two results.

  • The result with the Lowest Value
  • The result with the Highest Value

Expression Elements

Double-clicking any element in the expression elements section will insert it into the "Use the specified expression:" field.

For more information, see Expression Elements.

The following elements are available for your use: