Thursday, September 18, 2014

Running K-Means Clustering Algorithm against numerical data in Apache Mahout

Procedures :
  • Dataset Preparation
  • Generate Sequence File
  • Move the Sequence File to Hadoop Cluster and check the contents of the Sequence File
  • Plan and Run K-Means clustering algorithm
  • Export the K-Means output using Cluster Dumper tool
  • Export the K-Means output as graphml file
  • Visualize the output of K-Means using graphml in Gephi
Dataset Preparation :
          I am gonna generate float values (having 2 Dimension and 5 different ranges) using Java Gaussian function as given below,

import java.util.Random;

public final class RandomGaussian {
public static void main(String... aArgs) {
RandomGaussian gaussian = new RandomGaussian();
double MEAN = -0.9f;
double VARIANCE = 0.1f;
for (int idx = 1; idx <= 25; ++idx) {
log(gaussian.getGaussian(MEAN, VARIANCE));
}
}
private Random fRandom = new Random();
private double getGaussian(double aMean, double aVariance) {
return aMean + fRandom.nextGaussian() * aVariance;
}
private static void log(Object aMsg) {
System.out.println(String.valueOf(aMsg));
}
}

Generate Sequence File :
          If you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. The following java program will convert the above create numerical data into sequence vector file. SequencesFiles is a file with structure of key-value format.

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.math.DenseVector;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.VectorWritable;

class VectorFileCreation {
private VectorFileCreation() {
}

public static final int NUM_COLUMNS = 3;

public static void main(String[] args) throws Exception {
String INPUT_FILE = "inputvectorfile.csv";
String OUTPUT_FILE = "sampleseqfile";
List<NamedVector> apples = new ArrayList<NamedVector>();
NamedVector apple;
BufferedReader br = null;
br = new BufferedReader(new FileReader(INPUT_FILE));
String sCurrentLine;
while ((sCurrentLine = br.readLine()) != null) {
String item_name = sCurrentLine.split(",")[0];
double[] features = new double[NUM_COLUMNS - 1];
for (int indx = 1; indx < NUM_COLUMNS; ++indx) {
features[indx - 1] = Double.parseDouble(sCurrentLine.split(",")[indx]);
}
apple = new NamedVector(new DenseVector(features), item_name);
apples.add(apple);
}
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path path = new Path(OUTPUT_FILE);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector vector : apples) {
vec.set(vector);
writer.append(new Text(vector.getName()), vec);
}
writer.close();
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(OUTPUT_FILE), conf);
Text key = new Text();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
System.out.println(key.toString() + ","+ value.get().asFormatString());
}
reader.close();
}
}
Move the Sequence File to Hadoop Cluster and check the contents of the Sequence File :
      Use the hadoop shell commands to move the sequence file to Hadoop Cluster. The contents of the sequence file can't be viewed so executing the below command will show the contents of the sequence file

mahout seqdumper -i /your-hdfs-path-to seqfiles | less

Plan and Run K-Means clustering algorithm :
     Plan the clustering by choosing number of clusters and iterations and distance measure and execute the below commands
mahout kmeans -i /your-hdfs-path-to-seqfiles -c /your-hdfs-path-to-initial-cluster -o /your-hdfs-path-to-seqfiles-final-cluster -x <numeric value of iteration> -k <numeric value of clusters> -ow --clustering -cd <numeric value>

By default, it would use Squared Euclidean Distance Measure and convergance delta value as 0.5

Export the K-Means output using Cluster Dumper tool :

mahout clusterdump -i /your-hdfs-path-to-clusters-*-final -p /your-hdfs-path-to-clusteredPoints -o /your-local-destination-path-with-filename.txt

Export the K-Means output as graphml file :

mahout clusterdump -i /your-hdfs-path-to-clusters-*-final -p /your-hdfs-path-to-clusteredPoints -of GRAPH_ML -o /your-local-destination-path-with-filename.graphml

Visualize the output of K-Means using graphml in Gephi :
  • Open the Graphml in Gephi and visualize the centroid and cluster point as shown below,

4 comments:

  1. I am unable to execute the code used to generate Sequence file. Which jar file should I add to mahout.math.VectorWritable? Its saying that this is deprecated. Which mahout.math version are you using ?

    ReplyDelete