Monday, October 28, 2013

RStudio Server: Configuring the Server on Ubuntu

Overview

RStudio is configured by adding entries to two configuration files (note that these files do not exist by default so you will need to create them if you wish to specify custom settings):
/etc/rstudio/rserver.conf
/etc/rstudio/rsession.conf
After editing configuration files you should perform a check to ensure that the entries you specified are valid. This can be accomplished by executing the following command:
$ sudo rstudio-server test-config
Note that this command is also automatically executed when starting or restarting the server (those commands will fail if the configuration is not valid).

Network Port and Address

After initial installation RStudio accepts connections on port 8787. If you wish to change to another port you should create an/etc/rstudio/rserver.conf file (if one doesn't already exist) and add a www-port entry corresponding to the port you want RStudio to listen on. For example:
www-port=80
By default RStudio binds to address 0.0.0.0 (accepting connections from any remote IP). You can modify this behavior using the www-address entry. For example:
www-address=127.0.0.1
Note that after editing the /etc/rstudio/rserver.conf file you should always restart the server to apply your changes (and validate that your configuration entries were valid). You can do this by entering the following command:
$ sudo rstudio-server restart

External Libraries

You can add elements to the default LD_LIBRARY_PATH for R sessions (as determined by the R ldpaths script) by adding anrsession-ld-library-path entry to the server config file. This might be useful for ensuring that packages can locate external library dependencies that aren't installed in the system standard library paths. For example:
rsession-ld-library-path=/opt/local/lib:/opt/local/someapp/lib

Specifying R Version

By default RStudio Server runs against the version of R which is found on the system PATH (using which R). You can override which version of R is used via the rsession-which-r setting in the server config file. For example, if you have two versions of R installed on the server and want to make sure the one at /usr/local/bin/R is used by RStudio then you would use:
rsession-which-r=/usr/local/bin/R
Note again that the server must be restarted for this setting to take effect.

Setting User Limits

There are a number of settings which place limits on which users can access RStudio and the amount of resources they can consume. This file does not exist by default so if you wish to specify any of the settings below you should create the file.
To limit the users who can login to RStudio to the members of a specific group, you use the auth-required-user-groupsetting. For example:
auth-required-user-group=rstudio_users

Additional Settings

There is a separate /etc/rstudio/rsession.conf configuration file that enables you to control various aspects of R sessions (note that as with rserver.conf this file does not exist by default). These settings are especially useful if you have a large number of potential users and want to make sure that resources are balanced appropriately.

Session Timeouts

By default if a user hasn't issued a command for 2 hours RStudio will suspend that user's R session to disk so they are no longer consuming server resources (the next time the user attempts to access the server their session will be restored). You can change the timeout (including disabling it by specifying a value of 0) using the session-timeout-minutes setting. For example:
session-timeout-minutes=30
Note that a user's session will never be suspended while it is running code (only sessions which are idle will be suspended).

Package Library Path

By default RStudio sets the R_LIBS_USER environment variable to ~/R/library. This ensures that packages installed by end users do not have R version numbers encoded in the path (which is the default behavior). This in turn enables administrators to upgrade the version of R on the server without reseting users installed packages (which would occur if the installed packages were in an R-version derived directory).
If you wish to override this behavior you can do so using the r-libs-user settings. For example:
r-libs-user=~/R/packages

CRAN Repository

Finally, you can set the default CRAN repository for the server using the r-cran-repos setting. For example:
r-cran-repos=http://cran.case.edu/
Note again that the above settings should be specified in the /etc/rstudio/rsession.conf file (rather than the aforementioned rserver.conf file).

RStudio Server: Managing the Server on Ubuntu

Overview

RStudio server management tasks are performed using the rstudio-server utility (installed under /usr/sbin in binary distributions). This utility enables the stopping, starting, and restarting of the server, enumeration and suspension of user sessions, taking the server offline, as well as the ability to hot upgrade a running version of the server.

Stopping and Starting

If you installed RStudio using a package manager binary (e.g. a Debian package or RPM) then RStudio is automatically registred as a deamon which starts along with the rest of the system. On Ubuntu this registration is performed using an Upstart script at /etc/init/rstudio-server.conf. On other systems an init.d script is installed at /etc/init.d/rstudio-server.
To manually stop, start, and restart the server you use the following commands:
$ sudo rstudio-server stop
$ sudo rstudio-server start
$ sudo rstudio-server restart

Managing Active Sessions

There are a number of administrative commands which allow you to see what sessions are active and request suspension of running sessions (note that session data is not lost during a suspend).
To list all currently active sessions:
$ sudo rstudio-server active-sessions
To suspend an individual session:
$ sudo rstudio-server suspend-session <pid>
To suspend all running sessions:
$ sudo rstudio-server suspend-all
The suspend commands also have a "force" variation which will send an interrupt to to the session to request the termination of any running R command:
$ sudo rstudio-server force-suspend-session <pid>
$ sudo rstudio-server force-suspend-all
The force-suspend-all command should be issued immediately prior to any reboot so as to preserve the data and state of active R sessions accross the restart.

Taking the Server Offline

If you need to perform system maintenance and want users to receive a friendly message indicating the server is offline you can issue the following command:
$ sudo rstudio-server offline
When the server is once again available you should issue this command:
$ sudo rstudio-server online

Upgrading to a New Version

If you perform an upgrade of RStudio Server using a package manager binary (e.g. a Debian package or RPM) and a version of RStudio Server is currently running, then the upgrade process will also ensure that active sessions are immediately migrated to the new version. This includes the following behavior:
  • Running R sessions are suspended so that future interactions with the server automatically launch the updated R session binary
  • Currently connected browser clients are notified that a new version is available and automatically refresh themselves.
  • The core server binary is restarted

RStudio Server : Installation on Ubunutu

What is RStudio?


RStudio IDE is an open source Integrated Development Environment for the statistical analysis program R. RStudio Server provides a web version of RStudio IDE that allows easy development on a VPS.

Since our VPSs are billed by the hour, it's surprisingly cheap to spin up a 24 core instance, crunch some data, and then destroy the VPS.

Installing RStudio In a VPS


First, install R, apparmor, and gdebi.
sudo apt-get install r-base libapparmor1 gdebi-core
Next, download and install the correct package for your architecture. On 32-bit Ubuntu, execute the following commands.
wget http://download2.rstudio.org/rstudio-server-0.97.336-i386.deb -O rstudio.deb
On 64-bit Ubuntu, execute the following commands.
wget http://download2.rstudio.org/rstudio-server-0.97.336-amd64.deb -O rstudio.deb

Install the package.

sudo gdebi rstudio.deb

Creating RStudio User


It is not advisable to use the root account with RStudio, instead, create a normal user account just for RStudio. The account can be named anything, and the account password will be the one to use in the web interface.
sudo adduser rstudio

RStudio will use the user's home directory as it's default workspace.

Using R Studio


RStudio can be access through port 8787. Any user account with a password can be used in RStudio. 

Let's test that RStudio is working correctly by installing a quantitative finance package fromCRAN, the R package repository.

Run the following command inside RStudio to install quantmod.

install.packages("quantmod")

Rstudio 1

Next, let's test out RStudio's graphing capabilities by plotting the stock price of Apple. The graph will appear in the bottom right panel of RStudio.

library('quantmod')
data <- new.env()
getSymbols('AAPL', data)
plot(data$AAPL)

Rstudio 2

R is a really powerful tool and there are hundreds of useful packages available from CRAN. You can learn the basics of R at Try R.

Friday, October 25, 2013

Pentaho Data Integration 4.4 and Hadoop 1.0.4

Prerequisites:

  • Copy the hadoop-20 folder to a hadoop-104 folder(created by the user manually) in the /opt/pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/ directory.
  • Replace the following JARs in the client (subfolder) with the versions from the Apache Hadoop 1.0.4 distribution:
    • commons-codec-1.0.4.jar
    • hadoop-core-1.0.4.jar
  • Add the following JAR from the Hadoop 1.0.4 distribution to the client (subfolder) as well:
    • commons-configuration-1.0.6.jar
  • Then change the property in plugins.properties to point to the new folder:
    • active.hadoop.configuration=hadoop-104
  • Start hadoop with the user created while hadoop installation. Note: Hadoop credentials provided in the page 4 step number 12
  • Start PDI

Transformation [CSV → Hadoop]:

Follow the instructions below to begin creating your transformation.
  • Click New in the upper left corner of Spoon.
  • Select Transformation from the list.
  • Under the Design tab, expand the Input node; then, select and drag a CSV file input step onto the canvas on the right.
  • Expand the Big Data node; click and drag a Hadoop File Output step onto the canvas..
  • To connect the steps to each other, you must add a hop. Hops are used to describe the flow of data between steps in your transformation. To create the hop, click theCSV file input step, then press and hold the <SHIFT> key then draw a line to the Hadoop File Output step.
  • Double click the CSV file input step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and navigate to the input file location
  • Select the desired input file. (e.g) sample.csv
  • Click the Get fields button to get the columns of the input file and click OK button.
  • Double click the Hadoop File Output step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and Open File dialog box appears as shown below
  • Enter the following credentials to connect with HDFS:
    • Look In – Check whether you have selected HDFS
    • In Connection,
      • Server – localhost
      • Port - 54310
      • User ID - hduser
      • Password - password
  • Click Connect button to connect with HDFS and Open File dialog box appears as shown below:
  • Click OK button.
  • Provide the desired output file name next to the path selected in the Filename field
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Click the Save icon and save the transformation you have created.
  • Click on the Run icon in the right panel to execute the transformation.
  • The Execute a Transformation dialog box appears.
  • NoteLocal Execution is enabled by default. Select Detailed logging.
  • Click Launch.

Transformation [ Hadoop → Text File]:

Follow the instructions below to begin creating your transformation.

  • Click New in the upper left corner of Spoon.
  • Select Transformation from the list.
  • Under the Design tab, expand the Big Data node; then, select and drag a Hadoop File Input step onto the canvas on the right.
  • Expand the Output node; click and drag a Text file output step onto the canvas..
  • To connect the steps to each other, you must add a hop. Hops are used to describe the flow of data between steps in your transformation. To create the hop, click theHadoop File input step, then press and hold the <SHIFT> key then draw a line to the Text file output step.
  • Double click the Hadoop File Input step to open its edit properties dialog box.
  • In the File or directory field, click on the Browse button and Open File dialog box appears as shown below
  • Enter the following credentials to connect with HDFS:
    • Look In – Check whether you have selected HDFS
    • In Connection,
      • Server – localhost
      • Port - 54310
      • User ID - hduser
      • Password – password
  • Click Connect button to connect with HDFS and Open File dialog box appears as shown below:
  • Select the desired input file from HDFS. Click OK button.
  • Click ADD button corresponds to the File or directory field as shown below
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Double click the Text file output step to open its edit properties dialog box.
  • In the Filename field, click on the Browse button and navigate to the desired location where the output file to be placed
  • Provide the desired output file name next to the path selected in the Filename field
  • Navigate to the Fields tab, click the Get Fields button to get the columns of the input file and click OK button.
  • Click the Save icon and save the transformation you have created.
  • Click on the Run icon in the right panel to execute the transformation.
  • Click Launch.

MongoDB Installation on Ubuntu

  • Open Terminal and issue the below command to install the MongoDB on Ubuntu
    • apt-get install mongodb
  • Else download the tar file from the following location and untarEnable Authentication
  • Start/Stop the MongoDB
    • /etc/init.d/mongodb start/stop/restart
  • Adding User MongoDB
    • root@AX-PENTAHO:/usr/local# mongo
    • MongoDB shell version: 2.2.2
    • connecting to: test
    • > use admin;
    • switched to db admin
    • > db.addUser('admin','test123');
    • {
    • "user" : "admin",
    • "readOnly" : false,
    • "pwd" : "3ebea24ef5a0388efc523a0cb1ed54d1",
    • "_id" : ObjectId("5100f5ffb6b86baa08f17ff5")
    • }
  • Login to MongoDB using Admin Login
    • root@AX-PENTAHO:/usr/local# mongo
    • MongoDB shell version: 2.2.2
    • connecting to: test
    • > use admin
    • switched to db admin
    • > db.db.auth('admin','Amtex123');
    • Thu Jan 24 14:22:00 TypeError: db.db.auth is not a function (shell):1
    • > db.auth('admin','Amtex123');
    • 1
    • > exit
  • Listen MongoDB to all IP's
    • vim /etc/mongodb.conf
    • Change bind_ip = 127.0.0.1 to bind_ip = 0.0.0.0

Migrating the Report, Graphs, Dashboard created in Pentaho User Console (PUC) from v4.8.0 to v4.8.1

  • Copying the Dashboard, Reports, Graphs, Datasource and Datasource name from your current Pentaho v 4.8
    • Backup all the reports, graphs and dashboards from <pentaho>/server/biserver-ee/pentaho-solutions directory
    • Backup all the datasource csv files from <pentaho>/server/biserver-ee/pentaho-solutions/system/metadata/csvfiles
    • Backup all the datasource name from <pentaho>/server/biserver-ee/pentaho-solutions/system/olap
    • Backup all the resource files from <pentaho>/server/biserver-ee/pentaho-solutions/admin/resources/metadata
  • Restore the backup files to the corresponding location of new pentaho server i.e. v4.8.1 and restart the server.

CTools Integration with Pentaho installed on Ubuntu

Adding the Data Cleaner and Data Quality Plugins to Kettle

Data Cleaner :

Data Quality :
  • Download Easy data quality for Pentaho
  • The plugin source code and downloads are hosted on Sourceforge, so the first step is to go here to download it:
  • After download you will have a file named EasyDQ-PDI-plugin.jar.

Copy plugin file to Pentaho:
  • Copy the EasyDQ-PDI-plugin.jar file to the plugins/ directory of Pentaho Data Integration. The folder will already have a few other plugins, for instance it will look like this:
external image pdi-install-1_500x248.jpg
  • If you prefer you can also create a subdirectory in the plugins/ folder and put the file there.
Start Pentaho Data Integration:
  • Start Pentaho Data Integration by executing the spoon.bat file (or spoon.sh on *nix systems). Once the application has started, you should see the "Data Quality" category of steps when you work with transformations.
external image pdi-install-2.png

Monday, October 14, 2013

Integration of R/Weka with Pentaho Data Integration (Spoon/Kettle)

Weka Installation and Integration with R:
  • Download and install the software from the following link http://www.cs.waikato.ac.nz/ml/weka/
  • Installation location on AE3 Server : /opt/weka-3-7-10
  • For linux, Navigate to that directory and issue the below command to start the Weka
    • java -Xmx1000M -jar weka.jar
  • Under Weka GUI Chooser, Navigate to Tools -> Package Manager
  • Install the below dependence Packages thru Package manager
    • Rplugin
    • DTNB
    • TimeSeriesForecastin
    • naiveBayesTree
    • kfKettle
    • multiInstanceFilters
    • UserClassifier
  • Weka log file location /root/wekafiles

Weka integration with Pentaho Spoon: 
To integrate the Weka with Pentaho for doing data mining we need a PMML Model to fetch an input data with that model.
  • Creating and Exporting the Model in Weka
  • Open Weka Explorer
  • Open the CSV file and Navigate to Classify tab and choose J48 classifier which is best Data Learning classifier available under the Choose -> Tree -> J48
  • Click Start button to create a Classifier Model
  • after running successfully, you would get the above screenshot. The value of correctly classified Instance should be above the 60%
  • Save the Model in specific location.
  • Open Pentaho Spoon and create a Transformation as given below
  • On Weka scoring object, Load the exported model from Weka and map the input field to the Model

  • Now the Weka is integrated with Pentaho and implement your Data Mining Concepts and Run the transformation

Wednesday, October 9, 2013

How to install and upgrade R up to date on Ubuntu linux

Installation of R on Ubuntu : 
  • R is included as part of the standard Ubuntu distribution, and can be installed with a command like  
    • sudo apt-get install r-base
  • Installation location of R and Library file
    • Installed location : /usr/lib/R 
    • Library file : /usr/local/lib/R /usr/share/R /etc/R /usr/bin/R
  • Open Terminal and simply type R to open the shell
  • Set the Environmental variable for R Systems in ./bashrc file
    • export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.35
    • export PATH=$JAVA_HOME/bin:$PATH
    • export R_HOME=/usr/lib/R
    • export PATH=$R_HOME/bin:$PATH
    • export LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/xe/lib:/usr/lib/R/lib:/usr/lib/jvm/java-6-sun-1.6.0.35/lib:/usr/lib/jvm/java-6-sun-1.6.0.35/jre/lib/amd64/server:/usr/local/lib/R/site-library/rJava/jri
    • export PATH=$LD_LIBRARY_PATH/bin:$PATH
  • Use below command to reconfigure/update the R Systems
    • sudo R CMD javareconf
  • Obviously the software included as part of the standard distribution usually lags a little behind the latest version, and this is usually quite acceptable for most users most of the time. However, R is evolving quite quickly at the moment, and for various reasons I have decided to skip Ubuntu 12.10 (quantal) and stick with Ubuntu 12.4 (precise) for the time being. Since R 2.14 is included with Ubuntu 12.4, and I’d rather use R 2.15, I’d like to run with the latest R builds on my Ubuntu system.
  • Fortunately this is very easy, as there is a maintained repository for Ubuntu builds of R on CRAN. Full instructions are provided on CRAN, but here is the quick summary. First you need to know your nearest CRAN mirror – there is a list of mirrors on CRAN. I generally use the Bristol mirror, and so I will use it in the following.
Upgrade the R on Ubuntu : 
    • sudo su
    • echo "deb http://lib.stat.cmu.edu/R/CRAN/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
    • apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
    • apt-get update
    • apt-get upgrade
  • That’s it. You are updated to the latest version of R, and your system will check for updates in the usual way. There are just two things you may need to edit in line 2 above. The first is the address of the CRAN mirror (here “http://lib.stat.cmu.edu/R/CRAN”). The second is the name of the Ubuntu distro you are running (here “precise”).

Wednesday, October 2, 2013

Elasticsearch, Kettle and the CTools

When i was started to work the Elastic Search with PDI based on the following link http://pedroalves-bi.blogspot.co.uk/2011/07/elasticsearch-kettle-and-ctools.html?m=1, I have been faced many issues and while trying to debug those issues, I could not find much information/support from anybody. Therefore this blog describes the insertion of bulk data to Elastic Search engine using Elastic Search Bulk Insert object on kettle and integrates the output of Kettle with CDA.
Currently, it is not possible to run the Pentaho with higher version of Elastic Search e.g. 0.90.5. The main reason of it is that PDI components has been compiled with 0.16.3 classes. 

Prerequisite :

  • Elastic Search engine - ES 0.19.5
  • Pentaho BA Server - 4.8.0 GA
  • Kettle - 4.4
Installation of Elastic Search Engine :
  • Download ES ver. 0.19.5 fromhttp://www.elasticsearch.org/downloads/0-19-5/
  • Extract the elasticsearch-0.19.5.tar file under the usr/share directory.
  • Navigate to usr/share directory and issues the below command,
    • $ elasticsearch-0.19.5/bin/plugin -install mobz/elasticsearch-head
  • Navigate to usr/share/elasticsearch-0.19.5 directory and issues the below command to start the ES
    • $ bin/elasticsearch or bin/elasticsearch -f
  • open http://localhost:9200/_plugin/head/
Inserting the Bulk data to ES on Kettle Transformation :
  • Create a Transformation (Table input -> Elastic Search Bulk Insert)
  • Copy the elasticsearch* and lucene* jars from 0.19.5 ES server/lib to .../design-tools/data-integration/lib/elasticsearch directory.
  • Copy the attached jar file (es_0.19.4_patch.jar) into PDI/lib
  • Restart the PDI
  • In Elastic Search Bulk Insert object, Provide the IP address and Port number of the Elastic Search Engine on Servers tab.
    • Note : You need to select the value for ID Field.
  • Click the Test Connection, you could see the below screen-shot which means PDI is connected to ES.
  • Run the Transformation. It inserts the bulk data to the ES engine.


Elastic Search Server : Sample JSON Input and Output Query:



Elastic Search Query Transformation :
  • Create a Transformation.



    • Note : Here I have used java script to extracting all the datas from CACM.
Kettle Data Source Input in CDE Dashboard:
  • Create a New CDE Dashboard in PUC.
  • Create a Kettle Dat source in Data source Tab.
  • Click Preview to display the values