Running K-Means Clustering Algorithm against numerical data in Apache Mahout ~ BI and Big Data Adventure via Open Source Technologies

Procedures :

Dataset Preparation
Generate Sequence File
Move the Sequence File to Hadoop Cluster and check the contents of the Sequence File
Plan and Run K-Means clustering algorithm
Export the K-Means output using Cluster Dumper tool
Export the K-Means output as graphml file
Visualize the output of K-Means using graphml in Gephi

Dataset Preparation :

I am gonna generate float values (having 2 Dimension and 5 different ranges) using Java Gaussian function as given below,

import java.util.Random;

public final class RandomGaussian {

public static void main(String... aArgs) {

RandomGaussian gaussian = new RandomGaussian();

double MEAN = -0.9f;

double VARIANCE = 0.1f;

for (int idx = 1; idx <= 25; ++idx) {

log(gaussian.getGaussian(MEAN, VARIANCE));

}

private Random fRandom = new Random();

private double getGaussian(double aMean, double aVariance) {

return aMean + fRandom.nextGaussian() * aVariance;

}

private static void log(Object aMsg) {

System.out.println(String.valueOf(aMsg));

}

Generate Sequence File :

If you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. The following java program will convert the above create numerical data into sequence vector file. SequencesFiles is a file with structure of key-value format.

import java.io.BufferedReader;

import java.io.FileReader;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;

import org.apache.mahout.math.DenseVector;

import org.apache.mahout.math.NamedVector;

import org.apache.mahout.math.VectorWritable;

class VectorFileCreation {

private VectorFileCreation() {

}

public static final int NUM_COLUMNS = 3;

public static void main(String[] args) throws Exception {

String INPUT_FILE = "inputvectorfile.csv";

String OUTPUT_FILE = "sampleseqfile";

List<NamedVector> apples = new ArrayList<NamedVector>();

NamedVector apple;

BufferedReader br = null;

br = new BufferedReader(new FileReader(INPUT_FILE));

String sCurrentLine;

while ((sCurrentLine = br.readLine()) != null) {

String item_name = sCurrentLine.split(",")[0];

double[] features = new double[NUM_COLUMNS - 1];

for (int indx = 1; indx < NUM_COLUMNS; ++indx) {

features[indx - 1] = Double.parseDouble(sCurrentLine.split(",")[indx]);

}

apple = new NamedVector(new DenseVector(features), item_name);

apples.add(apple);

}

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path path = new Path(OUTPUT_FILE);

SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,Text.class, VectorWritable.class);

VectorWritable vec = new VectorWritable();

for (NamedVector vector : apples) {

vec.set(vector);

writer.append(new Text(vector.getName()), vec);

}

writer.close();

SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(OUTPUT_FILE), conf);

Text key = new Text();

VectorWritable value = new VectorWritable();

while (reader.next(key, value)) {

System.out.println(key.toString() + ","+ value.get().asFormatString());

}

reader.close();

}

Move the Sequence File to Hadoop Cluster and check the contents of the Sequence File :

Use the hadoop shell commands to move the sequence file to Hadoop Cluster. The contents of the sequence file can't be viewed so executing the below command will show the contents of the sequence file

mahout seqdumper -i /your-hdfs-path-to seqfiles | less

Plan and Run K-Means clustering algorithm :

Plan the clustering by choosing number of clusters and iterations and distance measure and execute the below commands

mahout kmeans -i /your-hdfs-path-to-seqfiles -c /your-hdfs-path-to-initial-cluster -o /your-hdfs-path-to-seqfiles-final-cluster -x <numeric value of iteration> -k <numeric value of clusters> -ow --clustering -cd <numeric value>

By default, it would use Squared Euclidean Distance Measure and convergance delta value as 0.5

Export the K-Means output using Cluster Dumper tool :

mahout clusterdump -i /your-hdfs-path-to-clusters-*-final -p /your-hdfs-path-to-clusteredPoints -o /your-local-destination-path-with-filename.txt

Export the K-Means output as graphml file :

mahout clusterdump -i /your-hdfs-path-to-clusters-*-final -p /your-hdfs-path-to-clusteredPoints -of GRAPH_ML -o /your-local-destination-path-with-filename.graphml

Visualize the output of K-Means using graphml in Gephi :

Open the Graphml in Gephi and visualize the centroid and cluster point as shown below,

BI and Big Data Adventure via Open Source Technologies

Thursday, September 18, 2014

Running K-Means Clustering Algorithm against numerical data in Apache Mahout

4 comments:

About Me

Popular Posts

Blog Archive

My Blog List

Total Pageviews