Friday, October 25, 2013

Pentaho Data Integration 4.4 and Hadoop 1.0.4

Prerequisites:

  • Copy the hadoop-20 folder to a hadoop-104 folder(created by the user manually) in the /opt/pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/ directory.
  • Replace the following JARs in the client (subfolder) with the versions from the Apache Hadoop 1.0.4 distribution:
    • commons-codec-1.0.4.jar
    • hadoop-core-1.0.4.jar
  • Add the following JAR from the Hadoop 1.0.4 distribution to the client (subfolder) as well:
    • commons-configuration-1.0.6.jar
  • Then change the property in plugins.properties to point to the new folder:
    • active.hadoop.configuration=hadoop-104
  • Start hadoop with the user created while hadoop installation. Note: Hadoop credentials provided in the page 4 step number 12
  • Start PDI

Transformation [CSV → Hadoop]:

Follow the instructions below to begin creating your transformation.
  • Click New in the upper left corner of Spoon.
  • Select Transformation from the list.
  • Under the Design tab, expand the Input node; then, select and drag a CSV file input step onto the canvas on the right.
  • Expand the Big Data node; click and drag a Hadoop File Output step onto the canvas..
  • To connect the steps to each other, you must add a hop. Hops are used to describe the flow of data between steps in your transformation. To create the hop, click theCSV file input step, then press and hold the <SHIFT> key then draw a line to the Hadoop File Output step.
  • Double click the CSV file input step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and navigate to the input file location
  • Select the desired input file. (e.g) sample.csv
  • Click the Get fields button to get the columns of the input file and click OK button.
  • Double click the Hadoop File Output step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and Open File dialog box appears as shown below
  • Enter the following credentials to connect with HDFS:
    • Look In – Check whether you have selected HDFS
    • In Connection,
      • Server – localhost
      • Port - 54310
      • User ID - hduser
      • Password - password
  • Click Connect button to connect with HDFS and Open File dialog box appears as shown below:
  • Click OK button.
  • Provide the desired output file name next to the path selected in the Filename field
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Click the Save icon and save the transformation you have created.
  • Click on the Run icon in the right panel to execute the transformation.
  • The Execute a Transformation dialog box appears.
  • NoteLocal Execution is enabled by default. Select Detailed logging.
  • Click Launch.

Transformation [ Hadoop → Text File]:

Follow the instructions below to begin creating your transformation.

  • Click New in the upper left corner of Spoon.
  • Select Transformation from the list.
  • Under the Design tab, expand the Big Data node; then, select and drag a Hadoop File Input step onto the canvas on the right.
  • Expand the Output node; click and drag a Text file output step onto the canvas..
  • To connect the steps to each other, you must add a hop. Hops are used to describe the flow of data between steps in your transformation. To create the hop, click theHadoop File input step, then press and hold the <SHIFT> key then draw a line to the Text file output step.
  • Double click the Hadoop File Input step to open its edit properties dialog box.
  • In the File or directory field, click on the Browse button and Open File dialog box appears as shown below
  • Enter the following credentials to connect with HDFS:
    • Look In – Check whether you have selected HDFS
    • In Connection,
      • Server – localhost
      • Port - 54310
      • User ID - hduser
      • Password – password
  • Click Connect button to connect with HDFS and Open File dialog box appears as shown below:
  • Select the desired input file from HDFS. Click OK button.
  • Click ADD button corresponds to the File or directory field as shown below
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Double click the Text file output step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and navigate to the desired location where the output file to be placed
  • Provide the desired output file name next to the path selected in the Filename field
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Click the Save icon and save the transformation you have created.
  • Click on the Run icon in the right panel to execute the transformation.
  • Click Launch.

4 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.

    Hadoop training institutes in chennai
    Hadoop training velachery

    ReplyDelete
  2. There are lots of information about latest technology and how to get trained in them, like Big Data Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training). By the way you are running a great blog. Thanks for sharing this.

    Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete