300 likes | 541 Views
Large-scale Single-pass k-Means Clustering at Scale. Large-scale Single-pass k-Means Clustering. Large-scale k -Means Clustering. Goals. Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality
E N D
Goals • Cluster very large data sets • Facilitate large nearest neighbor search • Allow very large number of clusters • Achieve good quality • low average distance to nearest centroid on held-out data • Based on Mahout Math • Runs on Hadoop (really MapR) cluster • FAST– cluster tens of millions in minutes
Non-goals • Use map-reduce (but it is there) • Minimize the number of clusters • Support metrics other than L2
Anti-goals • Multiple passes over original data • Scale as O(k n)
What’s that? • Find the k nearest training examples • Use the average value of the target variable from them • This is easy … but hard • easy because it is so conceptually simple and you don’t have knobs to turn or models to build • hard because of the stunning amount of math • also hard because we need top 50,000 results • Initial prototype was massively too slow • 3K queries x 200K examples takes hours • needed 20M x 25M in the same time
How We Did It • 2 week hackathon with 6 developers from customer bank • Agile-ish development • To avoid IP issues • all code is Apache Licensed (no ownership question) • all data is synthetic (no question of private data) • all development done on individual machines, hosting on Github • open is easier than closed (in this case) • Goal is new open technology to facilitate new closed solutions • Ambitious goal of ~ 1,000,000 x speedup
How We Did It • 2 week hackathon with 6 developers from customer bank • Agile-ish development • To avoid IP issues • all code is Apache Licensed (no ownership question) • all data is synthetic (no question of private data) • all development done on individual machines, hosting on Github • open is easier than closed (in this case) • Goal is new open technology to facilitate new closed solutions • Ambitious goal of ~ 1,000,000 x speedup • well, really only 100-1000x after basic hygiene
What We Did • Mechanism for extending Mahout Vectors • DelegatingVector, WeightedVector, Centroid • Shared memory matrix • FileBasedMatrix uses mmap to share very large dense matrices • Searcher interface • ProjectionSearch, KmeansSearch, LshSearch, Brute • Super-fast clustering • Kmeans, StreamingKmeans
Projection Search java.lang.TreeSet!
K-means Search • Simple Idea • pre-cluster the data • to find the nearest points, search the nearest clusters • Recursive application • to search a cluster, use a Searcher!
But This Requires k-means! • Need a new k-means algorithm to get speed • Hadoop is very slow at iterative map-reduce • Maybe Pregel clones like Giraph would be better • Or maybe not • Streaming k-means is • One pass (through the original data) • Very fast (20 us per data point with threads) • Very parallelizable
Basic Method • Use a single pass of k-means with very many clusters • output is a bad-ish clustering but a good surrogate • Use weighted centroids from step 1 to do in-memory clustering • output is a good clustering with fewer clusters
Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease
How It Works • Result is large set of centroids • these provide approximation of original distribution • we can cluster centroids to get a close approximation of clustering original • or we can just use the result directly
Warning, Recursive Descent • Inner loop requires finding nearest centroid • With lots of centroids, this is slow • But wait, we have classes to accelerate that!
Warning, Recursive Descent • Inner loop requires finding nearest centroid • With lots of centroids, this is slow • But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)
Warning, Recursive Descent • Inner loop requires finding nearest centroid • With lots of centroids, this is slow • But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) • Empirically, projection search beats 64 bit LSH by a bit
Moving to Scale • Map-reduce implementation nearly trivial • Map: rough-cluster input data, output ß, weighted centroids • Reduce: • single reducer gets all centroids • if too many centroids, merge using recursive clustering • optionally do final clustering in-memory • Combiner possible, but essentially never important
Contact: • tdunning@maprtech.com • @ted_dunning • Slides and such: • http://info.mapr.com/ted-mlconf.html Hash tags: #mlconf #mahout #mapr