Pentaho:MatchUp:Options
← Data Quality Components for Pentaho
The Options tab allows you to specify which of the available output results you want returned.
Output Columns
In addition to the source data, you will often want to output processing information about the disposition of a record. This allows you to analyze the results in a number of ways.
- Result Codes
- This field displays the results of the comparison, whether the record is unique or is a duplicate, was suppressed or intersected, which matchcode combination resulted in a match, etc. See MatchUp Result Codes for a list of possible Result Codes that the component can return.
- Dupe Group
- Each group of matching records is assigned a sequential unique group number. This field displays the group number that the record, whether unique or a member of a duplicate grouping, is in.
- Dupe Count
- This field displays the number of matching records in each Dupe Group.
- Matchcode Key
- Based on the matchcode and matchcodes component used to process the source table(s), every record has a matchkey built. It is this key, a representation of the record, that is used in deduping. This field will be populated with the key and is useful in analyzing the output results.
Lookup Options
*This section is GREYED-OUT When there is only a single upstream source connected to the MatchUp Component.
When a second upstream data source is connected to the Lookup pin, that data source will be used as a filter.
- List Suppress
- Source pin records that match any record from the Lookup Pin will not be returned with an Output Result Code. They will be marked as Suppressed.
- List Intersect
- Only Source pin records that match any record from the Lookup Pin will be returned with an Output Result Code (Unique or Has Duplicate result code). The second to nth Source records that match the Lookup record will be marked with a Duplicate Result Code. Source pin records that do not match any Lookup pin records will be returned with a Non-Intersected Result Code.
- No Purge
- Source pin records that match other Source pin records will not be matched. In other words, a suppressed group or an intersected group will be returned as suppressed or intersected, but each record will have their own Dupe Group number.
Golden Record
The Golden Record Selection Option is used for selecting the best record amongst a group of duplicate records as your remaining Master Record. This allows for having a single accurate representation of each entity in your data. Thus in a group of duplicate records for example:
Name | Address | Last Update |
---|---|---|
John Doe | 123 Main St | 10/25/2003 |
John Doe | 123 Main St | 4/16/2012 |
John Doe | 123 Main St | 8/6/2008 |
We can prioritize the selection of the Golden Record by latest update. The selected remaining record will be:
Name | Address | Last Update |
---|---|---|
John Doe | 123 Main St | 4/16/2012 |
The component allows for using pre-defined algorithms that are commonly used to select the Golden Record, or you can also write your own custom expressions for flexible rule generation.
There are four algorithms that can be modified by clicking the ".." button to the right of its respective algorithm.
Multiple algorithms can be selected to evaluate the Golden Record. The algorithms are ordered from 1-4 in order of priority. Select the rule and click either the up arrow or down arrow to change this order. Golden Record selection will first be evaluated by the first selected algorithm in the list. If a Golden Record cannot be determined using the first selected algorithm, it will automatically cascade down to the second selected algorithm to re-evaluate the ties. The component will continue to cascade through all selected algorithms until a single golden record can be evaluated or until all the selected algorithms have been applied.
The four algorithms are Last Updated, Most Complete, Data Quality Score, and Custom Expression.
Last Updated
This algorithm allows for selecting the Golden Record based on the newest or oldest Date/Time.
- "Last Updated" Column
- Input requires a valid column of type Date/Time (DT_Date).
- Oldest/Newest Date
- Select whether the Golden Record should be selected based on which one has the Oldest/Newest Date.
- Example
- For the following data:
Name Address Last Update John Doe 123 Main St 10/25/2003 John Doe 123 Main St 4/16/2012 John Doe 123 Main St 8/6/2008
- Selecting the Newest Date for the “Last Update” column will yield the following results:
Name Address Last Update Results John Doe 123 Main St 4/16/2012 MS02 John Doe 123 Main St 8/6/2008 MS03 John Doe 123 Main St 10/25/2003 MS03
- The record containing MS02 is the selected Golden Record.
Most Complete
This algorithm allows for selecting the Golden Record based on the completeness/length of information.
The options allows for selecting multiple columns for evaluation. In a group of duplicate records, the Golden Record will be selected based on the selected columns has the longest concatenated string.
Example:
For the following data:
Name | Address |
---|---|
Joseph Doe | 123 Main St |
Joe Doe | 123 Main St |
J. Doe | 123 Main St |
Selecting the Name and Address columns will yield the following results:
Name | Address | Results |
---|---|---|
Joseph Doe | 123 Main St | MS02 |
Joe Doe | 123 Main St | MS03 |
J. Doe | 123 Main St | MS03 |
The record containing MS02 is the selected Golden Record.
Data Quality Score
This algorithm allows for selecting the Golden Record based on the Quality of Your Data. The Data Quality Score is required to be used in tandem with other Melissa Data Components.
- Column Containing Result Codes
- Input requires a valid column containing the Result Codes returned by a Melissa Data Component.
Rule
There are a total of 6 rules that can be utilized.
- Check the box next to the rule(s) that you wish to use.
- Rules are also ordered from 1-6 in order of priority. Select the rule and click either the up arrow or down arrow to change this order.
- Data Quality Score
- The Data Quality Score evaluates all possible results combined from Address, Email, Name, Phone and GeoCode in a single column.
- Address Quality Score
- The Address Quality Score evaluates your Address Results and selects which one has the best address composition and most deliverable.
- Name Quality Score
- The Name Quality Score evaluates your Name Results and selects which one has the most probable real person’s name.
- Phone Quality Score
- The Phone Quality Score evaluates your Phone Results and selects which one has the most valid 10-digit phone number.
- Email Quality Score
- The Email Quality Score evaluates your Email Results and selects which one has the most correct and deliverable email address.
- GeoCode Quality Score
- The GeoCode Quality Score evaluates your GeoCode Results and selects which one has the highest level of GeoCoding accuracy.
Custom Expression
The Custom Expression Algorithm allows you to set up a custom expression of your own design.
Expression
- Use a pre-built expression
- To use a pre-built expression, select one from the drop-down list. You may also remove the selected pre-built expression by clicking the self-titled button.
- Use the specified expression
- Alternatively, you may create your own expression in this field.
- Clicking the text button will check your custom expression for correct syntax.
- To save your custom built expression, click the "Save the above expression as a new Pre-Built Expression..." button.
Precedence
Choose one of the two radio buttons here to designate which result takes precendence over another when comparing two results.
- The result with the Lowest Value
- The result with the Highest Value
Expression Elements
Double-clicking any element in the expression elements section will insert it into the "Use the specified expression:" field.
For more information, see Expression Elements.
The following elements are available for your use: