Pentaho:FAQ

From Melissa Data Wiki
Jump to navigation Jump to search

← Data Quality Components for Pentaho


Manual Installation

You can manually install the components into Pentaho simply by copying the desired Melissa Data zip archive and extracting it to the data-integration/plugins/steps directory. Once extracted the component should have a folder hierarchy like MDPlugin\ with all JAR files, configuration files, and other libraries directly under the extracted folder. (sometimes unzipping will duplicate the directory, i.e. MDPlugin\MDPlugin\)

Once the component is in the directory, start Spoon and find the component, place it on the data flow palette, and run it for the first time to configure it (place License Key, set data paths, etc.)

If you cannot start the spoon GUI because you are running non GUI install, first install the components on a system that you can run the GUI, then copy the .kettle directory from

(C:\Users\Account\.kettle in Windows or home/User/.kettle in Linux)

to the non GUI install.


What is contained in the .kettle directory?

On first run of the components Melissa Data libraries and configurations are placed into the .kettle directory. Some files of note:

  1. mdProps.prop - Contains the License Key, URL endpoints, data files path and other configuraions the componets use.
  2. COMPONENT_pre_built_rules.xml - contains various pre-built filters for each of the components. These filters are made using the various results codes of the component (http://wiki.melissadata.com/index.php?title=Result_Code_Details) and logical operators (AND,OR).
  3. MD directory - contains the underly cleansing engines and JAR settings files for the components. You will notice both x86 and x64 libaries there, as well as .SO libraries for linux and .DLL libaries for Windows.
  4. matchup - contain matchup data files and configuraions.
  5. matchup\MatchUpEditor.exe - the matchcode editor program to create your own matching rules. Note this only runs on windows, however a MatchCode (.mc) file can be created and then copied to linux servers.
  6. matchup\mdMatchup.mc - the matchcode file that contains all of the matching rules. THe matchcode Componet reads this to allow one to choose the desired matchcode algorithm.
  7. matchup\mdMatchup.cfg - MatchUp Object configuration file. This file is used to override the default entries from the stock mdMatchUp lookup tables contained in the mdMatchUp.dat data file.
  8. matchup\mdMatchup.dat - Lookup tables for the Matchup engine to use.


I have created a custom MatchCode, how do I use that on other machines?

The easiest way is to copy the mdMatchup.mc file from the matchup directory in the hidden .kettle directory on your system and copy it to the desired machines.


How do I run the processes locally (not through your web services)?

Most of the components can be ran using local data/processing. By design Matchup and the Name verification engine and data are always local. Address, Name, Phone, Email, Geocordinates, IP address data can be downloaded and installed to run the components locally as well. You must have a License Key and subscription to processes in this manner, as well have the components configured to run in this manner under Processing Options\On-Premise. To obtain and License Key and download please call your CSR or sales directly at 1-800-800-6245 x3.


Where are the local data files located?

Each component has an "Advanced Configuration" button. Once this dialog is open, Browse to the "On-Premise" tab and make note of the data path. This should contain the FULL path to the directory containing the datafiles. This may be changed if desired by clicking on the Folder dialog.

If a download was obtained for the data files, the default directory these are placed in is

C:\Program Files\Melissa Data\DQT\Data

on windows. (this can be changed)


Memory Limit Increase

Pentaho Kettle's startup script uses the default memory settings for your Java environment, which may be insufficient for your work. If you're experiencing an OutOfMemory exception, you must increase the amount of heap space available to Pentaho by changing the Java options that set memory allocation. Follow the directions below to accomplish this.

  1. Exit Kettle if it is currently running.
  2. Edit your Spoon startup script [Spoon.bat on windows or Spoon.sh on Linux] and modify the -Xmx value so that it specifies a larger upper memory limit. For example:
    PENTAHO_DI_JAVA_OPTIONS="-Xmx2g -XX:MaxPermSize=256m"

    For Java Documentation see: http://docs.oracle.com/cd/E13222_01/wls/docs81/perform/JVMTuning.html#1109778
  3. Start Spoon and ensure that there are no memory-related exceptions.

The Java virtual machine instance that the Data Integration server uses now has access to more heap space, which should solve OutOfMemory exceptions and increase performance.


CentOS and RedHat Linux Missing XULRunner Libraries

Some versions of CentOS and RedHat Linux will not have some of the dependencies to run Kettle. Specifically missing the XULRunner libraries. To run Kettle in these environments (as well as some others) it may be necessary to download and extract the XULRUNNER libraries and point the Kettle startup scripts to it.

Download the xulrunner 1.9.2 from here: http://ftp.mozilla.org/pub/mozilla.org/xulrunner/nightly/2012/03/2012-03-02-03-32-11-mozilla-1.9.2/xulrunner-1.9.2.28pre.en-US.linux-x86_64.tar.bz2.

  1. Untar and move xulrunner/ to /opt/xulrunner
  2. Append the following line below last “OPT” variable in spoon.sh
    OPT="$OPT -Dorg.eclipse.swt.browser.DefaultType=mozilla -Dorg.eclipse.swt.browser.XULRunnerPath=/opt/xulrunner"