Pentaho:Hadoop Tutorial

From Melissa Data Wiki
Jump to navigation Jump to search

← Data Quality Components for Pentaho


The following steps will guide you in a basic Hadoop distribution for Pentaho.

For more details on a Pentaho/Hadoop setup for all major Hadoop distributions, please see the following link:


Set up your hadoop distribution for Pentaho. For all major Hadoop distributions in Pentaho/Hadoop see:

Configuration Files

Copy all the configuration files from the cluster and place it in the appropriate Pentaho Hadoop shims folder under C:\Pentaho\design-tools\data-integration\plugins\pentaho-big-data-plugin\hadoop-configurations in windows.

It’s important during the setup that you change the mapredsite.xml file in both the cluster and local to add the property to true.

We are using a HDP22 HortonWorks distribution but it should be similar for others.

Data Copy

On each node there needs to be a copy of the data. Login using ssh and create a directory /hadoop/yarn/local/DQT/data. /hadoop/yarn/local/usercache/<user>/... is the location that Hadoop copies the jar and other files to run the jobs, so /hadoop/yarn/local/DQT/data should be accessible by all users.

PENT Hadoop 01.png

Creating Directories with HUE

We will create directories in HDFS using HUE(GUI) but you can create directories and push files to HDFS from the command line if you choose to do so using the bin/hadoop fs or the bin/hdfs dfs command. Visit Hadoop’s website for the command line guide.

Login to HUE

Login to Hue as hdfs. You may need to create the user hdfs and add it to the superuser group.

PENT Hadoop 02.png

Create Folders

In the user directory in HDFS create a folder with the same name as the username for the computer.

PENT Hadoop 03.png

Create a folder named opt in the main directory of HDFS. Create a new dir to hold native objects on the HDFS. The location and name we use here is /DQT.

PENT Hadoop 04.png

Upload .dll's

Upload the appropriate object files (64 or 32 bit - .so for linux and .dll windows) to the newly created location.

Upload a copy of your mdProps.prop file to the newly created location. This copy of the props file will need to be edited so that data_path=<cluster data location>.

PENT Hadoop 05.png

Copy Melissa Data Plugins to HDFS

The full Melissa Data plugin needs to be copied to HDFS. It will be located under opt/pentao/mapreduce/{pentaho-shim name 6.1hdp22}/plugins/steps.

In the C:\Pentaho\design-tools\data-integration\plugins\pentaho-big-data-plugin there is file called plugin. Open that file and make sure the property pmr.kettle.dfs.install.dir=/opt/pentaho/mapreduce is there or uncommented.

Add the property pmr.kettle.additional.plugins=steps. This will copy the steps folder with all the Melissa Data plugs-in to HDFS. This will only copy if the steps folders do not exist in HDFS. So, if you want to update the plugins in HDFS make sure to delete the steps directory in HDFS and it will copy the new plugins when Pentaho runs a MapReduce job.

Under tools Hadoop distribution, pick the appropriate shim distribution.

PENT Hadoop 13.png


Create Cluster

In Pentaho under Hadoop clusters create a new cluster and add the properties. You might want to change the host file on your computer with the IP address and name of the cluster. Here is an example of the configuration in Pentaho.

PENT Hadoop 06.png

Click Test to test you cluster.

Create Transformation

In Pentaho create a new transformation for a job you’re trying to run. Here is an example of Match Up transformation for Hadoop. You must define in the field splitter the Field Names. They must be the same as the column names from the CSV file or database.

PENT Hadoop 07.png

PENT Hadoop 08.png

PENT Hadoop 09.png

PENT Hadoop 10.png

PENT Hadoop 12.png

PENT Hadoop 11.png

Create New Job

Close the transformation and create new Job.

You can use a Hadoop file copy if you want to copy files from the local disk to HDFS. Below is a setup with a Hadoop File copy and the Pentaho MapReduce Tool in Pentaho.

PENT Hadoop 14.png

Mapper Tab

Under Pentaho MapReduce Mapper select Mapper Input Step Name and the Mapper Output Step Name from the transformation. Be sure to include the location of the transformation.

PENT Hadoop 17.png

Job Setup Tab

Under the job setup include the input and output path in HDFS.

PENT Hadoop 16.png

Cluster Tab

Select the cluster information from the Cluster tab.

PENT Hadoop 15.png

User Defined Tab

In order for the files to be picked up and added to the correct paths you will need to add some setting in the User Defined tab of the MapReduce job entry.

Name 							Value

mapred.cache.files				<full path to file>#<File Name>, add the rest seperated by ","...


Note: Any files added to here are copied to the local cache folder for the given job.  The first part is the actual file location and name the part after the #  is the link name that will be created by hadoop.  				
Name 							Value

mapred.job.classpath.files      <full path to file>,<full path to file>.. and so on


Note: This is the setting that adds it to the LD_LIBRARY_PATH.

PENT Hadoop 19.png


Run the job. The output should be in the output folder in HDFS with the name part-00000.

PENT Hadoop 18.png