300 likes | 314 Views
Case Studies and Explorations with Kmeans Clustering. Paul Rodriguez PACE Gordon Summer Institute 2012. Clustering with Kmeans. Kmeans is a standard data driven technique Kmeans clustering Assign each point to one of a few clusters so that total distance to center is minimized
E N D
Case Studies and Explorations with Kmeans Clustering Paul Rodriguez PACE Gordon Summer Institute 2012
Clustering with Kmeans • Kmeans is a standard data driven technique • Kmeans clustering • Assign each point to one of a few clusters so that total distance to center is minimized • Options: distance function, number of clusters, initial cluster centers, number of iterations, stopping criteria
Clustering in HPC environment • Data Set Up: • Does it fit in memory? Split? sample? • Tools and Processors: • Which machines/queues: Normal compute nodes or vsmp? • How much coding/prototyping/optimization?
Clustering in HPC environment • Data to Try: • NYTimes Article • 1000 Genomes Data • Tools and Processors to Try: • Matlab: high level math programming tool • Map/Reduce: C++ program library
Matlab Parallel Computing Toolbox • Communication is handled for you (MPI or threads under the hood) • You still have to decide data/task set up LAB 1 X local part CLIENT X matrix . . . Separate nodes or threads LAB N X local part
Matlab PCT in a nutshell • Distributed toolbox provides distribute/gather functions • In job submission • Create job object: createMatlabPoolJob(scheduler information) • Create tasks for that job createTask(job,@myfunction,#tasks,{parameters..}) • In your code: spmd • D=codistributed(X); or D=codistributed.build(X); • < statements> • end;
Matlab PCT in a nutshell • A codistributed array is divided into local parts, each residing in the workspace of a different lab. • Practically: • find bottleneck in kmeans.m program • add code to distribute and gather data
Matlab PCT pseudo code … old Kmeans code … %NEW CODE: distribute data matrix X to local nodes spmd Xsd =codistributed(X); %declare it as distributed X_Local=getLocalPart(Xsd); %now get the part for this lab % Also distribute Cluster Means % Csd =codistributed(Cluster_Means_Set); Cluster_Means_Local=getLocalPart(Csd); …
Matlab PCT pseudo code %ALTERNATIVE: % split input file into parts prior and % use lab id to read in correct file spmd Currentlab=labindex; Read file ‘nytimes_forlab_’ + currentlab …
Matlab PCT pseudo code % Calculate Distance Matrix for this part as usual Distance_Part = get_distance(X_Local, Cluster_Means_Local); end; %end of SPMD block %Now Distance_Part are available as ‘distributed’ matrix in Client for i=1: num_labs Distance=Distance + Distance_Part{i}; end; rest of Kmeans Code ….
Matlab in vSMP setting • vSMP submission indicates threads • In submission script, • set environment variables for MKL (Intels Math Kernel Library) • In matlab code: • setenv('MKL_NUM_THREADS', getenv(number_of_procs); • No programming changes necessary, but programming considerations exist
Matlab: threads vs comm. Matrix Multiplication Matrix Inversion 32 threads 8 threads time time(s) 16 threads 32 threads N=10K 20K 30K 40K 50K Gb=2 6.5 14 25 40 N=10K 20K 30K 40K 50K Gb=2 6.5 14 25 40 Square Matrix size • threads: more is better for multiplication, less is better for inversion • (or use different operation)
Matlab original Kmeans Script 1. Difference_by_col=X(:,1)-Cluster_Means(1,1) XNxP Cluster_MeansMxP 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 … … … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 each row is a point in Rp 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 • square difference • sum as you loop across cols to get Distances to cluster center Works better for large N small P
Matlab Kmeans Script altered 1. Difference_by_row=X(1,:)-Cluster_Means(1,:) XNxP Cluster_MeansMxP 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 … … … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 each row is a point in Rp 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 • dot(difference_by_row) • loop across rows to get Distances Works better for large P and dot( ) will use threads
Matlab Kmeans Benchmarks • Kmeans on 10,000,000 entries from NYTimes articles (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words) • Running as full data matrix ~ 45K articles x102K words, • Each cell holds word count (double float) • about 37Gb in Matlab, total memory for script about 61Gb • Kmeans (original) runtime ~ 50 hours • Kmeans (altered) runtime ~ 10 hours, 8 threads
Matlab Kmeans Results cluster means shown with coordinates determining fontsize 7 viable clusters found
MapReduce Framework • A library for distributed computing • Started by Google, gaining popularity • Various implementations: Hadoop (distributed), Phoenix (threaded), Sandia (MPI) MR provides parallelization,concurrency, and intermediate data functions (by key&value) User outputs keys & values e.g. Ekanayake et al User defined functions
Paradigmatic Example: string counting • Scheduler: manage threads, initiate data split and call Map • Map: count strings, output key=string & value=count • Scheduler: re-partitions keys & values • Reduce: sum up counts MR provides parallelization,concurrency, and intermediate data functions (by key&value) User defines keys & values User defined functions
MapReduce Kmeans clustering • C-code for Kmeans(sample code with MapReduce Phoenix) • Use 10,000,000 entries from NYTimes articles • Running as full data matrix (int) ~ 45K docs x102K word tokens, ~ 20 Gb total in vSMP • Running time ~ 20 min, 32 threads
MapReduce Kmeans clustering • Use ~70,000,000 entries from NYTimes articles • full data matrix (int) ~ 300K docs x102K word tokens, • ~ 120 Gb total in vSMP memory • Running time ~ 120 min, 32 threads • Running time ~ 175 min, serial version
Case Study: Genomic Data (with Multi-Modal Imaging Lab, UCSD) • Genomic Sequence Database on over 1000 subjects (1000genomes.org). • Each sequence mapping is ~10Gb => 10Tb total data. • Goal: identify genetic variants and priors for analysis of brain imaging & sequence data together.
Exploring Genomic Data • How does genomic clustering matching demographics? • What categories of genes (e.g. coding, regulation) account for differences? • Start small: kmeans clustering for one chromosome.
Exploring Genomic Data • Starting small: Chromosone 11 aligned data • Shell script to retrieve 75 subject • wgetftp://ftp-trace.ncbi.nih.gov/ … (about 700Mb each file) • Preprocessing: • Download BAM (binary sequence alignment data) utilities • Run BAM function to get a consensus sequences (other consensus methods?) • Use perl,grep, wcetc to strip headers, get metadata about files • Pick Coding Scheme (A,C,G,T=1,2,3,4; No allele =0) • Gather summary statistics
Exploring Genomic Data • Data ends up as 250M integers (alleles): • 000001243111333324000004311322224 ….. • Try subsets: 3M, 10M, 50M integers • Use Matlab and MapReduce
Case Study: Cluster Exploration High correlation of Clustering despite much less alleles used Num Alleles Used . 3M O 50M + 10M Finnish subjects All others GBR Cluster ‘A’ Cluster ‘B’ 1 20 40 60 75 Subjects
Case Study: Cluster Exploration Do outliers belong in their own cluster? outlier Distance to Cluster Mean Should these be reassigned? 1 20 40 60 75 Subjects
Case Study: More steps • running multiple cluster sizes • visualizing clusters • use other distance functions (ie city-block) • do more data with vsmp (in progress) • other cluster algorithms, compare to PCA, MDS,
How to use full genome? • Distance Speed Up Heuristics • sampling columns during distance calculation • start with subsets of data points (ie rows) and add 1 at a time • only processor outliers or in points in between clusters • HPC set up • put data on flash, some data in memory, • use distributed jobs for some processing steps, vsmp for others
PACE Ongoing and Future • Continue building experience with large memory trade offs for Data Mining Algorithms • Support a variety of ways to execute a variety of tools