Mahout Clustering: Algorithms & Workflow for Document Clustering

CSE 491/891 Lecture 26 (Mahout Clustering)

Outline of Lecture • Previous lecture • Introduction to Mahout • Classification: logistic regression • Collaborative filtering: matrix factorization with ALS • This lecture • Clustering using Mahout • Writing and compiling Java program with Mahout API

Clustering Algorithms in Mahout • Several clustering algorithms available • K-means • Other algorithms • Fuzzy clustering • Spectral clustering • Latent Dirichlet allocation (a probabilistic clustering)

Clustering Algorithms in Mahout • To use the clustering algorithms, you must first prepare your input data: • Data must be stored in HDFS • Data must be stored as vectors in sequence file format • Mahout defines a Vector interface in the package org.apache.mahout.math.Vector • For applications such as document clustering, each document should be stored as a separate file in HDFS (the name of the file will be used to identify the cluster assignment after clustering step has ended)

Document Clustering • Suppose we want to cluster 16 scientific articles based on words that appear in the article titles filename Bio1: The sequence of the human genome Bio2: Gene expression profiling predicts clinical outcome of breast cancer Bio3: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling Bio4: Exhaustive matching of the entire protein sequence database Bio5: Integration of biological networks and gene expression data using Cytoscape Bio6: Combining biological networks to predict genetic interactions Bio7: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence Bio8: Quantitative monitoring of gene expression patterns with a complementary DNA microarray Graph1: Network structure and minimum degree Graph2: Graph minors Algorithmic aspects of tree width Graph3: Adaptation algorithms for binary tree networks Graph4: Fast robust BSP tree traversal algorithm for ray tracing Graph5: Approximating maximum clique with a Hopfield network Graph6: Clique partitions, graph compression and speeding-up algorithms Graph7: A graph theoretic generalization of the clique concept Graph8: An introduction to chordal graphs and clique trees

Preprocessing • Create a feature vector for each document • Each feature corresponds to a word (term) in the document • Need to preprocess the terms (e.g., convert all characters to lower case; remove punctuation marks; etc)

Preprocessing • Need to assign weights to each term in a document • Binary (0/1): presence/absence of a term in the document • Limitation: cannot distinguish important words from non-important ones • Counts: based on term frequency (TF) in the document • Limitation: unable to handle stopwords (words such as “the”, “a”, “of”, that appear frequently in documents)

Preprocessing

K-Means Clustering in Mahout • K-means clustering requires the following • A SequenceFile containing the input data to be clustered • Distance measure (default is Euclidean distance) • Number of clusters • Maximum number of iterations • Mahout iteratively applies the following steps: • Map: assigns each point to its nearest centroid • Reduce: recomputes the locations of centroids

Workflow for Document Clustering Upload data to HDFS Local directory HDFS Mahout seqdirectory Document preprocessing Mahout seq2sparse Mahout k-means Mahout clusterdump

Example: Document Clustering • Step 0: unpack the data files from class webpage hadoop> gzip –d documents.tar.gz hadoop> tar xzf documents.tar

Example: Document Clustering • Step 1: Upload the data to HDFS • The following command will upload the data from the documents in your local directory to the HDFS directory /user/yourMSU_ID/documents/input • Next step is to create feature vectors from the document data and store them in sequence file format • We’ll use Mahout’s seqdirectory and seq2sparse programs to do this

Example: Document Clustering • Step 2a: Preprocess the data • Invoke mahout seqdirectory to transform data into SequenceFile format Options: -i: input directory that contains the document files -o: output directory to store the sequence files -ow: overwrite the output directory (if it already exists) -c : character encoding (UTF-8 for Unicode)

Example: Document Clustering • Step 2a: Preprocess the data The documents are stored in sequencefileformat(key is filename, value is content of the file)

Example: Document Clustering • You can also view the content of a sequencefile using mahout seqdumper command: …

Example: Document Clustering • Step 2b: Preprocess the data • Invoke mahout seq2sparse to create sparse vectors Options: -i: input directory -o: output directory (that contains feature vectors) -ow: overwrite existing directory -nv: named value (create identifiers for each datainstance)

Example: Document Clustering • Step 2b: Preprocess the data • Invoke mahout seq2sparse to create sparse vectors Other useful options: -s : minimum support (frequency) of a term to be considered as part of dictionary (default = 2) -md: minimum document frequency of a term (default = 1) -x: maximum document frequency of a term -ng: maximum size of n-grams (default = 1) -wt: weighting scheme (e.g., tfidfor tf). Default is tfidf

Example: Document Clustering • Step 2b: Preprocess the data The seq2sparse program will create the following • dictionary.file: mapping of each term to its integer ID • tf-vectors: term frequency feature vector for each document • tfidf-vectors: normalized TFIDF vector for each document • df: document frequency counts

Example: Document Clustering • Let’s view the content of the dictionary file There are only 12 words! What happen to other words? All words have been converted to lower case characters But no stemming (does not remove the trailing “s” or “ing” in a word)

Example: Document Clustering • Let’s view the tfidf vectors Mahout uses VectorWritable to store the feature vectors

Example: Document Clustering • What have we done so far? • Load the document data from local directory to HDFS • Preprocess the documents • Convert to sequence file format • Create sparse TF or TFIDF vectors to represent the documents • Now, we’re ready to do the clustering • Apply k-means clustering as an example

Example: Document Clustering • Step 3: Apply k-means clustering to tfidf vectors Options: -i : input directory (can use tf-vectors or tfidf-vectors) -o: output directory-k: number of clusters -x: maximum number of iterations to execute k-means-c: initial centroids (if k is specified, a random set of points will be selected and written to this directory)-dm: distance measure-cl: assigns the input docs to clusters at the end of the process and puts the results in outputdir/clusteredPoints directory

Example: Document Clustering • Note that the output for each iteration of k-means algorithm is stored in the output directory For the document data set, the algorithm converges after 2 iterations (even though we specify a maximum iteration of 10) The final clustering is in the directory clusters-2-final

Example: Document Clustering • Step 4: Display the cluster centroids and top terms for each cluster Options: -i: directory that contains results of last k-means iteration -d: dictionary file that maps integer ID to corresponding terms -dt: dictionary type -b: maximum number of characters to display on each line -n: number of top terms to be displayed for each cluster -p: directory that contains the cluster ID of each document -o: output file for storing the clustering results

ClusterDump Output n: # points in the cluster Centroid vector Cluster ID This cluster (VL-6) has 11 documents (all bios and 3 graph documents). Keywords associated with the cluster are gene, expression, sequence, etc

ClusterDump Output This cluster (VL-12) contains 5 documents (all belong to the graph documents). Keywords associated with this cluster include clique, network, graph, and algorithms

Summary • To cluster a collection of documents • Store each document as a separate file • Upload the documents to HDFS • Apply mahout seqdirectory to convert the documents into sequence file format • Apply mahout seq2sparse to generate feature vectors (tfidf or tf) and perform other preprocessing • Apply mahout kmeans to cluster the vectors • Apply mahout clusterdump to display the clustering results

Mahout Clustering • The previous example shows how to apply Mahout’s k-means clustering on document data • It includes some preprocessing steps that are specific to document data • What if we want to cluster other types of data (time series, census data, gene expressions, etc)? Can we still use Mahout k-means?

Mahout Clustering • In order to cluster other types of data, we need to make sure the input data is stored on HDFS in sequence file format • Key is identifier of the data instance • Value is a VectorWritable object • Example: suppose you have a CSV file, how do we cluster them? • You’ll need to write a program to convert the file into a sequence file with • key = record identifier • value = VectorWritable object

Example: 2-D CSV Data • Suppose you need to cluster the following 2-D data (stored in CSV format) • We’ll write a Java program to convert the CSV file into a sequence file of VectorWritables

Using Mahout API • You can write a program that converts CSV to sequence file format using the Mahout API • The program takes 2 input parameters: • Name of input file to be converted (in local directory) • Name of output file after conversion (to be stored in HDFS)

csvLoader.java import <packages> public class csvLoader { public List <vector> loadData (String input) { … } public static void genSequenceFile(List<Vector> points, String output, FileSystem fs, Configuration conf) { …. } public static void main(String[] args) throws Exception { 1. Check input parameters 2. Load input data 3. Write the data records into a sequence file on HDFS }}

csvLoader.java Java libraries import java.io.File; import java.io.IOException; import java.io.BufferedReader; import java.io.FileReader; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.mahout.math.DenseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.VectorWritable; Hadoop libraries Mahout libraries

Main Program public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: java csvLoader <input> <output>"); System.exit(1); } List<Vector> vectors = loadData(args[0]); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); genSequenceFile(vectors, args[1], fs, conf); }} Read input data from local directory and store them as a list of Vectors Write the list of Vectors to HDFS in SequenceFile format: key: record ID Value: VectorWritable

Loading Data from CSV File public static List<Vector> loadData(String input) { 1. Create a List object to store the feature vectors 2. Read each line (record) of the CSV file - break the line into tokens using comma as delimiter - create a Vector object to store the feature values - add the Vector object to the list 3. Return the List of vectors }

Function to Read from CSV File public static List<Vector> loadData(String input) throws IOException{ List<Vector> records = new ArrayList<Vector>(); BufferedReaderbr = new BufferedReader(new FileReader(input)); String line = ""; StringTokenizerst = null; inti; while ((line = br.readLine()) != null) { 1. Parse each line to create a point object 2. Add the point to data records } return records; }

Function to Read from CSV File public static List<Vector> loadData(String input) throws IOException{ … while ((line = br.readLine()) != null) { st = new StringTokenizer(line, ","); ArrayList<Double> weights = new ArrayList<Double>(); while (st.hasMoreTokens()) { weights.add(Double.parseDouble(st.nextToken())); } double[] point = new double[weights.size()]; Iterator<Double> iterator = weights.iterator(); i = 0; while(iterator.hasNext()) { point[i++] = iterator.next().doubleValue(); } Vector vec = new DenseVector(point.length); vec.assign(point); records.add(vec); } return records; }

Function to Create SequenceFile public static void genSequenceFile(List<Vector> points, String output, FileSystem fs, Configuration conf) throws IOException { 1. Create a SequenceFile writer object 2. For each vector stored in the List - send to the writer a (key,value) pair, where key is record number and value is a vector of feature values 3. Close the SequenceFile writer }

Function to Create SequenceFile public static void genSequenceFile(List<Vector> points, String output,FileSystem fs, Configuration conf) throws IOException { SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new Path(output), LongWritable.class, VectorWritable.class); long recNum = 0;VectorWritablevec = new VectorWritable(); for (Vector point : points) {vec.set(point); writer.append(new LongWritable(recNum++), vec); }writer.close(); }

Compilation • You need to include the following paths to your CLASSPATH variable: • AWS: /usr/lib/hadoop/hadoop-common-2.7.3-amzn-1.jar /usr/lib/mahout/mahout-hdfs-0.12.2.jar /usr/lib/mahut/mahout-math-0.12.2.jar • To set the classpath:

Execution • We need to merge csvLoader.classwith Mahout job jar file in order to execute the program • Compile the code • Make sure you’ve set the classpath(see previous slide) • Copy the mahout job jar file • Add class file to mahout job jar file

Execution Usage: hadoop jar mahout-core*-job.jar csvLoader <input> <output> Now, we can apply k-means on the sequence file

Clustering • Apply k-means (with k=2) • Examine the output • Dump output to ASCII text file (2dresults.txt)

Clustering Results

Mahout Clustering: Algorithms & Workflow for Document Clustering