Pentaho:Profiler:Output Pins

From Melissa Data Wiki
Jump to navigation Jump to search

← Data Quality Components for Pentaho

Profiler Navigation
Overview
Tutorial
Advanced Configuration
Profiler Tabs
Input Fields
Analysis Options
Output
Output Pins
Result Codes


Unlike the other Melissa Data components for Pentaho, the Profiler Component has a set of 11 possible output pins. Each output pin will contain a static set of output columns which cannot be edited or modified.

PENT Profiler OutputPins.png

The following are the possible output pins for the Profiler Component:


Column-Based Counts

The output fields associated with Column-Based Counts provides analysis and information based on the selected input column as a whole.

The following are all the output fields returned for Column –Based Counts and their description.

Column Name Description
ColumnName Name of the Input Column
ColumnType This is the actual Data Type for that column as defined in the SSSI Pipeline.

Here is a list of all possible Column Types:

ProfilerColumnType Enumerations

Value Name Description
1 ColumnTypeInt1 8 bit signed integer
2 ColumnTypeInt2 16 bit signed integer
3 ColumnTypeInt4 32 bit signed integer
4 ColumnTypeInt8 64 bit signed integer
5 ColumnTypeUInt1 8 bit unsigned integer
6 ColumnTypeUInt2 16 bit unsigned integer
7 ColumnTypeUInt4 32 bit unsigned integer
8 ColumnTypeUInt8 64 bit unsigned integer
9 ColumnTypeReal4 32 bit floating point number
10 ColumnTypeReal8 64 bit floating point number
11 ColumnTypeNumeric Numeric Value
12 ColumnTypeDecimal Numeric Value
13 ColumnTypeCurrency Numeric Value
14 ColumnTypeFixedMBCSString Fixed-length multi-byte string
15 ColumnTypeVariableMBCSString Variable-length multi-byte string
16 ColumnTypeFixedUnicodeString Fixed-length Unicode string
17 ColumnTypeVariableUnicodeString Variable length Unicode string
18 ColumnTypeDate Date
19 ColumnTypeDBDate Database date
20 ColumnTypeDBTime Database Time
21 ColumnTypeDBTime2 Database Time
22 ColumnTypeDBTimeStamp Time stamp
23 ColumnTypeDBTimeStamp2 Time stamp
24 ColumnTypeDBTimeStampOffset Date/Time offset
25 ColumnTypeFileTime File Date/Time
26 ColumnTypeBoolean Boolean Value
27 ColumnTypeGUID GUID value
28 ColumnTypeBytes Byte Array
29 ColumnTypeImage Image Array
DataType This is the Expected Data Type selected in the “Expected Contents” on the Input Field Tab of the Profiler Component.

Here is a list of all possible Data Types:

ProfilerDataType Enumerations

Value Name Description
1 DataTypeFullName Full name
2 DataTypeInverseName Inverse ordered name (Last – First)
3 DataTypeNamePrefix Name prefix
4 DataTypeFirstName First name
5 DataTypeMiddleName Middle name
6 DataTypeLastName Last name
7 DataTypeNameSuffix Name Suffix
8 DataTypeTitle Title / Department
9 DataTypeCompany Company or Organization
10 DataTypeAddress Street address
11 DataTypeCity City Name
12 DataTypeStateOrProvince US state or Canadian province
13 DataTypeZipOrPostalCode US zip code or Canadian postal code
14 DataTypeCityStateZip City/State/Zip combined
15 DataTypeCountry Country Name
16 DataTypePhone Phone number
17 DataTypeEmail Email address
18 DataTypeString Generic string value
19 DataTypeNumeric Generic numeric value
20 DataTypeDateMDY Generic Date/Time(formatted Month Day Year)
21 DataTypeDateYMD Generic Date/Time(formatted Year Month Day)
22 DataTypeDateDMY Generic Date/Time(formatted Day Month Year)
23 DataTypeBoolean Generic Boolean Value
ColumnSize Returns the specified Column Type set in the Advanced Options.
ColumnPrecision Returns the specified Column Precision in the Advanced Options.
ColumnScale Returns the specified Column Scale in the Advanced Options.
ValueRangeFrom Returns the specified Upper Bound for the Value Ranges in the Advanced Options.
ValueRangeTo Returns the specified Lower Bound for the Value Ranges in the Advanced Options.
DefaultValue Returns the specified Default Value in the Advanced Options.
CustomPatterns Returns the specified RegEx Pattern set in the Advanced Options.
InferredDataType This function returns a column’s inferred data type. The inferred data type is used to determine if a prevalent data type is seen for the majority of values in this column. For a deviant value to be returned (i.e., a value that differs from the user-specified data type), the count of that detected data type must exceed all other detected data type counts by at least 20%.

Here is a list of all possible Inferred Data Types:

ProfilerDataType Enumerations

Value Name Description
1 DataTypeFullName Full name
2 DataTypeInverseName Inverse ordered name (Last – First)
3 DataTypeNamePrefix Name prefix
4 DataTypeFirstName First name
5 DataTypeMiddleName Middle name
6 DataTypeLastName Last name
7 DataTypeNameSuffix Name Suffix
8 DataTypeTitle Title / Department
9 DataTypeCompany Company or Organization
10 DataTypeAddress Street address
11 DataTypeCity City Name
12 DataTypeStateOrProvince US state or Canadian province
13 DataTypeZipOrPostalCode US zip code or Canadian postal code
14 DataTypeCityStateZip City/State/Zip combined
15 DataTypeCountry Country Name
16 DataTypePhone Phone number
17 DataTypeEmail Email address
18 DataTypeString Generic string value
19 DataTypeNumeric Generic numeric value
20 DataTypeDateMDY Generic Date/Time(formatted Month Day Year)
21 DataTypeDateYMD Generic Date/Time(formatted Year Month Day)
22 DataTypeDateDMY Generic Date/Time(formatted Day Month Year)
23 DataTypeBoolean Generic Boolean Value
Sortation Returns the column’s natural sortation. This is the sortation order seen in the values as they were input. In order for a column to be considered near-sorted, no more than 10% of the input values must be out of order.

This column returns one of the following enumerations.

Enum Value Sortation Type Description
0 SortUnknown No sortation detected.
1 SortStringAscending Values are sorted ascending, using a string comparison.
2 SortStringDescending Values are sorted descending, using a string comparison.
3 SortNumericAscending Values are sorted ascending, using a numeric comparison.
4 SortNumericDescending Values are sorted descending, using a numeric comparison.
5 SortDateAscending Values are sorted ascending, using date/time comparison.
6 SortDateDescending Values are sorted descending, using date/time comparison.
SortationPercent Percentage of how well a column is sorted. This is only reported for columns where GetColumnSortation returned a value other than SortUnknown. The sortation percentage is determined by counting the number of re-ordering values that would be required to put the list of values into a sorted state, and then dividing this value by the worst-case value (i.e., re-ordering required for a reverse-sorted list.)
MostPopularCount Number of records that have the most popular count or most repeating values
DistinctCount Number of Distinct Values
UniqueCount Number of Unique Values
DefaultValuleCount Number of records that contains the Default Value set in the Advanced Options
BelowRangeCount Number of records that are below the lower bounds set in the Advanced Options
AboveRangeCount Number of records that are above the lower bounds set in the Advanced Options
AboveSizeCount Number of records that have values higher than the ColumnSize set in the Advanced Options
AbovePrecisionCount Number of records that have values higher than the Precision set in the Advanced Options
AboveScaleCount Number of records that have values higher than the Scale set in the Advanced Options
InvlidRegExCount Number of records that did not pass the Regular Expression set in the Advanced Options
EmptyCount Number of records with Empty Values
NullCount Number of records with NULL Values
InvlidDataCount Number of records that do not match the specified detected Column Type (not the Expected Data Type)
InvlidUTF8Count Number of records that are not valid UTF-8
NonPrintingCharCount Number of records with non-printable characters
DiacriticCharCount Number of records with Diacritics
ForeignCharCount Number of records with foreign language character sets
AlphaOnlyCount Number of records that only have English alphabet letters
NumericOnlyCount Number of records with numbers only
AlphaNumericCount Number of records with English alphabet letters and numbers only
UpperCaseOnlyCount Number of records that have all upper case letters
LowerCaseOnlyCount Number of records that have all lower case letters
MixedCaseCount Number of records that have mixed Upper and Lower casing
SingleSpaceCount Number of records that have multiple words separated only by single spaces
MultiSpaceCount Number of records that have multiple words separated by more than one space at least once.
LeadingSpaceCount Number of records that have a leading space
TrailingSpaceCount Number of records that have a trailing space
MaxSpaces Maximum number of spaces between words
MinSpaces Minimum number of spaces between words
TotalSpaces Number of spaces between words only. Not including leading and trailing spaces.
TotalWordBreaks Number of words breaks found
AvgSpaces Average Number of spaces found between words
DecorationCharCount Number of decorative characters (comma, pipe, tab, or double quote) at the end or beginning of the string.
ProfanityCount Number of records with profanities
InconsistentDataCount Number of records which doesn’t match the specified Expected Data Type
StringMaxValue Returns the highest string value (as sorted A-Z). Not to be confused with String Max Length.
StringMinValue Returns the lowest string value (as sorted A-Z). Not to be confused with String Min Length.
StringQ1Value The first Quartile value for strings. Strings are first ordered from lowest to highest value, then the Quartile is selected.
StringMedValue The second Quartile value for strings. Strings are first ordered from lowest to highest value, then the Quartile is selected.
StringQ3Value The third Quartile value for strings. Strings are first ordered from lowest to highest value, then the Quartile is selected.
StringMaxLength Length of the longest string.
StringMinLength Length of the shortest string.
StringAvgLength Average length for all strings
StringQ1Length Length of the first quartile string value
StringMedLength Length of the second quartile string value
StringQ3Length Length of the third quartile string value
WordMaxValue Returns the highest word value (as sorted A-Z). Not to be confused with word Max Length.
WordMinValue Returns the lowest word value (as sorted A-Z). Not to be confused with word Max Length.
WordQ1Value The first Quartile value for all words. All Words are first ordered from lowest to highest value, then the Quartile is selected.
WordMedValue The second Quartile value for all words. All Words are first ordered from lowest to highest value, then the Quartile is selected.
WordQ3Value The third Quartile value for all words. All Words are first ordered from lowest to highest value, then the Quartile is selected.
WordMaxLength Length of the longest word
WordMinLength Length of the shortest word
WordAvgLength Average length for all words
WordQ1Length Length of the first quartile word value
WordMedLength Length of the second quartile word value
WordQ3Length Length of the third quartile word value
MaxWords Returns a columns maximum number of words
MinWords Returns a columns minimum number of words
AvgWords Returns a columns average number of words
NumericMaxValue Returns the highest numeric value (as sorted numerically).
NumericMinValue Returns the lowest numeric value (as sorted numerically).
NumericAvgValue Average Value for all numbers in a column
NumericQ1Value The first Quartile value for all numbers. All numbers are first ordered from lowest to highest value, then the Quartile is selected.
NumericQ1IntValue The first Quartile integer (decimals removed) value for all numbers. All numbers are first ordered from lowest to highest value, then the Quartile is selected.
NumericMedValue The second Quartile value for all numbers. All numbers are first ordered from lowest to highest value, then the Quartile is selected.
NumericMedIntValue The secon Quartile integer (decimals removed) value for all numbers. All numbers are first ordered from lowest to highest value, then the Quartile is selected.
NumericQ3Value The third Quartile integer value for all numbers. All numbers are first ordered from lowest to highest value, then the Quartile is selected.
NumericQ3IntValue The third Quartile integer (decimals removed) value for all numbers. All numbers are first ordered from lowest to highest value, then the Quartile is selected.
NumericStdDevValue Standard Deviation Value
DateMaxValue Returns the latest date value (as sorted by dates).
DateMinValue Returns the oldest date value (as sorted by dates).
DateAvgValue Average Date Value
DateQ1Value The first Quartile value for all dates. All numbers are first ordered from oldest to latest date, then the Quartile is selected.
DateMedValue The second Quartile value for all dates. All numbers are first ordered from oldest to latest date, then the Quartile is selected.
DateQ3Value The third Quartile value for all dates. All numbers are first ordered from oldest to latest date, then the Quartile is selected.
TimeMaxValue Returns the latest time value (as sorted by time).
TimeMinValue Returns the oldest time value (as sorted by time).
TimeAvgValue Average Time Value
TimeQ1Value The first Quartile value for all times. All numbers are first ordered from oldest to latest time, then the Quartile is selected.
TimeMedValue The second Quartile value for all times. All numbers are first ordered from oldest to latest time, then the Quartile is selected.
TimeQ3Value The third Quartile value for all times. All numbers are first ordered from oldest to latest time, then the Quartile is selected.
DateTimeNoCenturyCount Number of records that contain no century indicator. (eg. 19-01-63 instead of 19-01-1963)
NameInconsistentOrderCount Number of records that don’t correspond to the specified expected name order. (eg. “Smith, John” for a column set to a data type of DataTypeFullName)
NameMultipleNameCount Number of records with multiple names (eg. John and Jane Smith)
NameSuspiciousCount Number of records with suspicious names like “Mickey Mouse”
StateCount Number of records with US States, both standardized and unstandardized
ProvinceCount Number of records with Canadian Provinces, both standardized and unstandardized
StateProvinceNonStandardCount Number of records that contain non-standardized US States of Canadian Provinces
StateProvinceInvalidCount Number of records that contain invalid US States or Canadian Provinces
ZipCodeCount Number of records containing US Zip Codes
Plus4Count Number of records containing full Zip Plus4 Codes
ZipCodeInvalidCount Number of records that contain invalid ZIP Codes
PostalCodeCount Number of records containing Canadian Postal Codes
PostalCodeInvalidCount Number of records that contain invalid Canadian Postal Codes
ZipCodePostalCodeInvalidCount Number of records that contain invalid US Zips or Canadian Postal Codes
StateZipCodeMismatchCount Number of records where the Zip Code does not match the US State.
ProvincePostalCodeMismatchCount Number of records where the Postal Code does not match the Canadian Province
CountryNonStandardCount Number of records that contain a country in a non-ISO Standard Countries
CountryInvalidCount Number of records that contain non-recognized country names
EmailSyntaxCount Number of records with invalid Email Syntax
EmailMobileDomainCount Number of records with Emails with mobile domains
EmailMispelledDomainCount Number of records with misspelled Email domains
EmailSpamtrapDomainCount Number of records with spamtrap Email domains
EmailDisposableDomainCount Number of records with disposable Email domains
PhoneInvalidCount Number of records with non-US or non-Canadian Phone Numbers.


Date-Time Frequencies

This output pin returns each distinct value for a column with Date/Time and the counts for each distinct value. Date/time values are gathered for all column types (for example, Strings), not just Date/Time data types.

Column Name Description
ColumnName Name of the Input Column
Value All distinct values for Dates.
Count Number of times a distinct Date appears in the profiled data for this particular column.


Length Frequencies

This output pin allows you to step through each distinct value length for a specific column.

Column Name Description
ColumnName Name of the Input Column
Length All distinct lengths for each value.
Count Number of times a distinct length appears in the profiled data for this particular column.


Overall Record Counts

This output pin returns counts based on all selected columns for profiling as a whole record.

Column Name Description
RecordCount Number of records processed
EmptyCount Number of records with empty values for all columns. An empty value is not Null, can contain spaces, and has no string or value.
NullCount Number of null records
ExactMatchDistinctCount Number of exact matching records. Casing and punctuation will be ignored.
ExactMatchDupesCount Number of exact matching duplicates. Casing and punctuation will be ignored.
ExactMatchLargestGroup Returns the size of the largest group of duplicate records using an Exact Match matching criteria. All data fields provided will be compared. Casing and punctuation will be ignored.
ContactMatchDistinctCount Returns the number of distinct records using Address, First Name and Last Name matching criteria.

The Contact Match matching rules used are:

Zip5 AND First Name AND Last Name AND Street Number AND Street Name (All have to match) OR Zip5 AND First Name AND Last Name AND PO Box (All have to match)

Note: The Profiler will parse data present in a single field such as an AddressLine or a FullName for matching purposes. (e.g., From an Address field containing “12 Main Street,” the Profiler will take the “12” as the Street Number and “Main” as the Street Name for its matching needs.)

ContactMatchDupesCount Number of duplicate records using Contact Match described under ContactMatchDistinctCount
ContactMatchLargestGroup Returns the size of the largest group of duplicates using Contact Match described under ContactMatchDistinctCount
HoueholdMatchDistinctCount Returns the number of distinct records using Address, and Last Name matching criteria.

The Household Match matching rules used are:

Zip5 AND Last Name AND Street Number AND Street Name (All have to match) OR Zip5 AND Last Name AND PO Box (All have to match)

Note: The Profiler will parse data present in a single field such as an AddressLine or a FullName for matching purposes. (e.g, From an Address field containing “12 Main Street,” the Profiler will take the “12” as the Street Number and “Main” as the Street Name for its matching needs.)

HoueholdMatchDupesCount Number of duplicate records using the Household Match matching strategy described under HouseholdMatchDistinctCount.
HoueholdMatchLargestGroup Returns the size of the largest group of duplicate records using the Household Match matching strategy described under HouseholdMatchDistinctCount.
AddressMatchDistinctCount Returns the number of distinct records using Address only matching criteria.

The Address Match matching rules used are:

Zip5 AND Street Number AND Street Name (All have to match) OR Zip5 AND PO Box (both have to match)

Note: The Profiler will parse data present in a single field such as an AddressLine or a FullName for matching purposes. (e.g., From an Address field containing “12 Main Street,” the Profiler will take the “12” as the Street Number and “Main” as the Street Name for its matching needs.)

AddressMatchDupesCount Number of duplicate records using Address Match matching strategy decribed under HouseholdMatchDistinctCount.
AddressMatchLargestGroup Returns the size of the largest group of duplicate records using Address Match matching strategy decribed under HouseholdMatchDistinctCount.


PassThru & Result Codes

This output pin will include all of the original contents that have been selected as “Pass Through” in the Input Field Tab. This output pin will also include a Results Code Field for all columns that have been selected to output “Results” in the Input Field Tab:

PENT Profiler InputFields.png

For more information on result codes, see Result Code Use.


Pattern & RegEx Frequencies

This output pin allows you to step through each distinct value pattern for a specific column.

Column Name Description
ColumnName Name of the Input Column
Value The distinct Pattern Value representation for all records in a column
RegEx The distinct Regular Expression representation for all records in a column. The regular expressions are 'tuned' for your specified data type. Where appropriate, letters or numbers may be lumped together using the 'one or more' quantifier.
Example Returns an example value represented by the pattern. This will always be the first value encountered while profiling.
Count Number of times the distinct patterns appears in the profiled data for this particular column.


Profile Session

This output pin returns information about the processed job. This information can be stored and used to identify information such as what table was profiled, who did the profiling, what version of the profiler was used, the start and end time of the job, as well as the job name and description.

Column Name Description
BuildNo Current Build of the Profiler Component
ProfileStart Start Time of the processed profiling job
ProfileEnd End Time of the processed profiling job
TableName Name of the Table being processed as specified in the Analysis Options Tab
UserName Name of the User processing the job as specified in the Analysis Options Tab
JobName Name of the Job as specified in the Analysis Options Tab
JobDesription Description of the Job as specified in the Analysis Options Tab


Sound-Alike Frequencies

This output pin allows you to step through each distinct SoundEx pattern for a specific column.

Column Name Description
ColumnName Name of the Input Column
Value The distinct SoundEx pattern value representation for all records in a column
Example Returns an example value represented by the SoundEx pattern. This will always be the first value encountered while profiling.
Count Number of times the distinct SoundEx patterns appears in the profiled data for this particular column.


Value Frequencies

This output pin allows you to step through each distinct value for a specific column.

Column Name Description
ColumnName Name of the Input Column
Value Returns a column's distinct value
Count Number of times the distinct value appeared in the profiled data for this particular column.


Work Length Frequencies

This output pin allows you to step through each distinct word length for a specific column.

Column Name Description
ColumnName Name of the Input Column
Length Returns a column's distinct word length
Count Number of times the distinct word length appears in the profiled data for this particular column.


Word Value Frequencies

This output pin allows you to step through each distinct word for a specific column.

Column Name Description
ColumnName Name of the Input Column
Value Returns a column's distinct word
Count Number of times the distinct word appears in the profiled data for this particular column. Multiple occurrences of the same word in a single record will all be counted.