Procedures :
- Dataset Preparation
- Generate Sequence File
- Move the Sequence File to Hadoop Cluster and check the contents of the Sequence File
- Plan and Run K-Means clustering algorithm
- Export the K-Means output using Cluster Dumper tool
- Export the K-Means output as graphml file
Dataset Preparation :
I
am gonna generate float
values (having
2 Dimension and 5 different ranges) using
Java Gaussian function as given below,
import
java.util.Random;
public
final
class
RandomGaussian {
public
static
void
main(String... aArgs) {
RandomGaussian
gaussian = new
RandomGaussian();
double
MEAN = -0.9f;
double
VARIANCE = 0.1f;
for
(int
idx = 1; idx <= 25; ++idx) {
log(gaussian.getGaussian(MEAN,
VARIANCE));
}
}
private
Random fRandom
= new
Random();
private
double
getGaussian(double
aMean, double
aVariance) {
return
aMean + fRandom.nextGaussian()
* aVariance;
}
private
static
void
log(Object aMsg) {
System.out.println(String.valueOf(aMsg));
}
}
|
Generate Sequence File
:
If
you need to process some numerical data, you need to write some
utility functions to write the numerical data into sequence-vector
format. The following java program will convert the above create
numerical data into sequence vector file. SequencesFiles is a file
with structure of key-value format.
import
java.io.BufferedReader;
import
java.io.FileReader;
import
java.util.ArrayList;
import
java.util.List;
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.fs.FileSystem;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.SequenceFile;
import
org.apache.hadoop.io.Text;
import
org.apache.mahout.math.DenseVector;
import
org.apache.mahout.math.NamedVector;
import
org.apache.mahout.math.VectorWritable;
class
VectorFileCreation {
private
VectorFileCreation() {
}
public
static final int NUM_COLUMNS = 3;
public
static void main(String[] args) throws Exception {
String
INPUT_FILE = "inputvectorfile.csv";
String
OUTPUT_FILE = "sampleseqfile";
List<NamedVector>
apples = new ArrayList<NamedVector>();
NamedVector
apple;
BufferedReader
br = null;
br
= new BufferedReader(new FileReader(INPUT_FILE));
String
sCurrentLine;
while
((sCurrentLine = br.readLine()) != null) {
String
item_name = sCurrentLine.split(",")[0];
double[]
features = new double[NUM_COLUMNS - 1];
for
(int indx = 1; indx < NUM_COLUMNS; ++indx) {
features[indx
- 1] = Double.parseDouble(sCurrentLine.split(",")[indx]);
}
apple
= new NamedVector(new DenseVector(features), item_name);
apples.add(apple);
}
Configuration
conf = new Configuration();
FileSystem
fs = FileSystem.get(conf);
Path
path = new Path(OUTPUT_FILE);
SequenceFile.Writer
writer = new SequenceFile.Writer(fs, conf, path,Text.class,
VectorWritable.class);
VectorWritable
vec = new VectorWritable();
for
(NamedVector vector : apples) {
vec.set(vector);
writer.append(new
Text(vector.getName()), vec);
}
writer.close();
SequenceFile.Reader
reader = new SequenceFile.Reader(fs, new Path(OUTPUT_FILE), conf);
Text
key = new Text();
VectorWritable
value = new VectorWritable();
while
(reader.next(key, value)) {
System.out.println(key.toString()
+ ","+ value.get().asFormatString());
}
reader.close();
}
}
|
Move the Sequence File
to Hadoop Cluster and check the contents of the Sequence File :
Use
the hadoop shell commands to move the sequence file to Hadoop
Cluster. The contents of the
sequence file can't be
viewed so executing the
below command will show the contents of the sequence file
mahout seqdumper -i
/your-hdfs-path-to seqfiles | less
|
Plan and Run K-Means
clustering algorithm :
Plan
the clustering by choosing number
of clusters and iterations
and distance measure and execute the below commands
mahout kmeans -i /your-hdfs-path-to-seqfiles -c /your-hdfs-path-to-initial-cluster -o /your-hdfs-path-to-seqfiles-final-cluster -x <numeric value of iteration> -k <numeric value of clusters> -ow --clustering -cd <numeric value> |
By
default, it would use Squared Euclidean Distance Measure and
convergance delta value as 0.5
Export the K-Means
output using Cluster Dumper tool :
mahout clusterdump -i /your-hdfs-path-to-clusters-*-final -p /your-hdfs-path-to-clusteredPoints -o /your-local-destination-path-with-filename.txt |
Export the K-Means
output as graphml file :
mahout clusterdump -i /your-hdfs-path-to-clusters-*-final -p /your-hdfs-path-to-clusteredPoints -of GRAPH_ML -o /your-local-destination-path-with-filename.graphml |
Visualize the output of
K-Means using graphml in Gephi :
- Open the Graphml in Gephi and visualize the centroid and cluster point as shown below,
I am unable to execute the code used to generate Sequence file. Which jar file should I add to mahout.math.VectorWritable? Its saying that this is deprecated. Which mahout.math version are you using ?
ReplyDeleteyozgat
ReplyDeleteadana
adıyaman
afyon
aksaray
HLM63N
whatsapp görüntülü show
ReplyDeleteücretli.show
1WW8
görüntülü.show
ReplyDeletewhatsapp ücretli show
U3OH7Z