From Melissa Data Wiki
Jump to navigation Jump to search

← Data Quality Components for Pentaho

Pentaho Introduction Navigation
System Requirements
Quick Start
Getting Started
Mac Installation

Melissa Data Components for Pentaho® is a collection of data cleansing services built as components for Pentaho.

Data Quality Tasks

The Data Quality Components for Pentaho are designed from the ground up to handle many Data Quality related tasks in three different major categories:

Cateogry 1

The first category covers the basics of Contact Data Quality ie: Address Checking, Geocoding, Phone Verification, Name parsing, Email validation and SmartMover (NCOALink) change of address processing. These Data enrichment and cleansing capabilities are available in the first release.

Category 2

The second category of tools for Pentaho are in the Data Matching area, and specifically the Melissa Data flagship MatchUp component, which organizes data records into identifiable duplicate groups and links/merges related records within or across data sets. This transforms leverages extremely powerful Fuzzy Matching algorithms; (Jaro, n-Gram); (Jaro-Winkler); (n-Gram); and the unique ability to understand common data types found in contact data, such as addresses, nicknames, full names, company names etc: MatchUp enhances the efficiency and effectiveness of your database by giving you the ability to eliminate duplicated customer and prospect records with user specified customizable criteria to realize a single, accurate view of each customer.

Category 3

The third category of Data Quality in the product, available soon, will be the Data Profiler transform. This component is used to analyze individual and multiple columns to determine relationships between columns and tables. The purpose of data profiling tasks is to develop a clearer picture of the content of your data in several ways and examine whether your existing data sources meet the organization’s quality standards. Some of the features of Profiler include Column Profiling – This task identifies problems in your data, such as invalid dates. It reports average, minimum, and maximum statistics for numeric columns. Value Distribution – identifies all values in each selected column and reports normal and outlier values in a column, and Column Pattern Distribution – Identifies invalid strings or irregular expressions in your data. Used altogether Pentaho covers all the facets of Data Quality as defined by Gartner™, and can be used by organizations large and small to realize the immediate benefits of a Data Quality regimen.

Features and Benefits

Generalized Cleansing

Corrects data values to meet specific business standards, customer business rules, or relationship constraints.

Parsing, Standardization, and Verification

Parses and restructures data into a common format to build more consistent data, such as standardizing addresses to USPS® specifications, or to custom-defined values and patterns specific to a particular business need. Also verifies addresses actually exist.


Adds value to customer data by attaching additional bits of data from other sources including; latitude/longitude coordinates; demographic data; full name parsing; phone number verification; and email validation.


Automates real-time processes to detect when data exceeds pre-set limits so you can immediately recognize and correct issues before the quality of your data declines.

Additional Features

  • Delivers a single view of the customer
  • Access and integrate any data source including large data files
  • Supports data quality and MDM initiatives
  • Access and integrate any data source including big data
  • Lowest cost of ownership