Pentaho:Profiler:Analysis Options

From Melissa Data Wiki
Jump to navigation Jump to search

← Data Quality Components for Pentaho

Profiler Navigation
Overview
Tutorial
Advanced Configuration
Profiler Tabs
Input Fields
Analysis Options
Output
Output Pins
Result Codes



The Analysis Options tab allows for enabling/disabling certain profiling calculations. Disabling unused Analysis Options will become beneficial due to the increase in processing time.

Analysis Options

Sort Analysis
This is an analysis of any prevailing sortation for each profiled column. This enables/disables the sortation analysis, which can increase profiling time. This time penalty grows geometrically as more records are added. If you are not interested in this statistic, disable it to decrease your profiling time.
MatchUp Analysis
This is an analysis of duplicate record detection. This enables/disables duplicate record detection. Duplicate analysis increases the profiling time by under 5% and ProfileData profiling time by about 30%.
RightFielder Analysis
This is an analysis of profiled columns' inferred data type (e.g., Full Name, Address, etc.). This enables/disables inferred data type analysis. This analysis is responsible for the Inconsistent Data and Inferred Data Type statistics. This increases the profiling time by under 10%.
Data Aggregation
This is an analysis of aggregate data determination (e.g., averages, median, quartiles, etc.). This enables/disables all forms of aggregation and value gathering. Any statistic that cannot be determined incrementally (for example, median, population standard deviation, etc.) is determined via aggregation. This analysis is also responsible for all value tables (Frequency, Pattern, SoundEx, etc.). All iterators and data aggregation statistics are dependent on this analysis. This increases profiling time by over 90%.


Setup Options

The Setup Options are not required. They are used purely for documentation purposes and will have no impact on profiling results.

Table Name
This function sets the user name for a particular run.
User Name
This function sets the user name for a particular run.
Job Name
This function sets the job name for a particular run.
Job Description
This function sets the job description for a particular run.