Pentaho/Contact Zone:Profiler

From Melissa Data Wiki
Jump to navigation Jump to search

← Contact Zone

← Data Quality Components for Pentaho

Profiler Navigation
Overview
Tutorial
Advanced Configuration
Profiler Tabs
Input Fields
Analysis Options
Output
Output Pins
Result Codes



Profiler Overview

Melissa Data’s Data Profiler is a component that can be used to analyze a table’s data. This analysis provides a large number of statistics at varying levels of detail. Using these statistics, you can make educated decisions on what strategies you may need to employ to handle the data.


Supported Data Profiling Techniques

Discovery

The analysis of new data before it is inserted into a Data Warehouse. This analysis is used to ensure that the data is correctly fielded, consistently formatted, standardized, etc. Because it can be very difficult to fix problems once data has been merged into a Data Warehouse, it is critical that issues are detected and eradicated prior to the merge.

Monitoring

The continual analysis of warehoused data in an effort to ensure a consistent quality of data. In systems where records are actively inserted, updated and deleted, it is nearly impossible to maintain a comprehensive set of business rules that foresee every situation. In addition, in systems that support multiple methods of access (ie, web, desktop, tablet/phone), it can be difficult to ensure that all program code adequately enforces all business rules.

Columns and Data Types

The Profiler is designed to work with a variety of column types, and analyzes data to ensure that it adheres to the limitations imposed by the user-specified type.

  • Numeric:Integers (8, 16, 32 or 64-bit), Floats (single or double), Decimal and Currency.
  • String:Unicode and Multi-byte, both fixed- or variable-length.
  • Date and/or Time, of varying resolutions.
  • Boolean

Data Analysis Summary

Deep data analysis is performed on several levels:

  • General Formatting analysis is used to determine if the input data ‘looks’ like what is expected.
  • Content analysis relies on reference data to determine if the input data contains information consistent with what is expected.
  • Field analysis determines if the input data is consistently fielded, using the data contained in the entire record to analyze the context of the data.