Large-scale Single-pass k-Means Clustering at Scale

Large-scale Single-pass k-Means Clustering at Scale

Large-scale Single-pass k-Means Clustering

Large-scale k-Means Clustering

Goals • Cluster very large data sets • Facilitate large nearest neighbor search • Allow very large number of clusters • Achieve good quality • low average distance to nearest centroid on held-out data • Based on Mahout Math • Runs on Hadoop (really MapR) cluster • FAST– cluster tens of millions in minutes

Non-goals • Use map-reduce (but it is there) • Minimize the number of clusters • Support metrics other than L2

Anti-goals • Multiple passes over original data • Scale as O(k n)

Why?

K-nearest Neighbor withSuper Fast k-means

What’s that? • Find the k nearest training examples • Use the average value of the target variable from them • This is easy … but hard • easy because it is so conceptually simple and you don’t have knobs to turn or models to build • hard because of the stunning amount of math • also hard because we need top 50,000 results • Initial prototype was massively too slow • 3K queries x 200K examples takes hours • needed 20M x 25M in the same time

How We Did It • 2 week hackathon with 6 developers from customer bank • Agile-ish development • To avoid IP issues • all code is Apache Licensed (no ownership question) • all data is synthetic (no question of private data) • all development done on individual machines, hosting on Github • open is easier than closed (in this case) • Goal is new open technology to facilitate new closed solutions • Ambitious goal of ~ 1,000,000 x speedup

How We Did It • 2 week hackathon with 6 developers from customer bank • Agile-ish development • To avoid IP issues • all code is Apache Licensed (no ownership question) • all data is synthetic (no question of private data) • all development done on individual machines, hosting on Github • open is easier than closed (in this case) • Goal is new open technology to facilitate new closed solutions • Ambitious goal of ~ 1,000,000 x speedup • well, really only 100-1000x after basic hygiene

What We Did • Mechanism for extending Mahout Vectors • DelegatingVector, WeightedVector, Centroid • Shared memory matrix • FileBasedMatrix uses mmap to share very large dense matrices • Searcher interface • ProjectionSearch, KmeansSearch, LshSearch, Brute • Super-fast clustering • Kmeans, StreamingKmeans

Projection Search java.lang.TreeSet!

How Many Projections?

K-means Search • Simple Idea • pre-cluster the data • to find the nearest points, search the nearest clusters • Recursive application • to search a cluster, use a Searcher!

But This Requires k-means! • Need a new k-means algorithm to get speed • Hadoop is very slow at iterative map-reduce • Maybe Pregel clones like Giraph would be better • Or maybe not • Streaming k-means is • One pass (through the original data) • Very fast (20 us per data point with threads) • Very parallelizable

Basic Method • Use a single pass of k-means with very many clusters • output is a bad-ish clustering but a good surrogate • Use weighted centroids from step 1 to do in-memory clustering • output is a good clustering with fewer clusters

Algorithmic Details Foreach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease

How It Works • Result is large set of centroids • these provide approximation of original distribution • we can cluster centroids to get a close approximation of clustering original • or we can just use the result directly

Parallel Speedup? ✓

Warning, Recursive Descent • Inner loop requires finding nearest centroid • With lots of centroids, this is slow • But wait, we have classes to accelerate that!

Warning, Recursive Descent • Inner loop requires finding nearest centroid • With lots of centroids, this is slow • But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)

Warning, Recursive Descent • Inner loop requires finding nearest centroid • With lots of centroids, this is slow • But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) • Empirically, projection search beats 64 bit LSH by a bit

Moving to Scale • Map-reduce implementation nearly trivial • Map: rough-cluster input data, output ß, weighted centroids • Reduce: • single reducer gets all centroids • if too many centroids, merge using recursive clustering • optionally do final clustering in-memory • Combiner possible, but essentially never important

Contact: • tdunning@maprtech.com • @ted_dunning • Slides and such: • http://info.mapr.com/ted-mlconf.html Hash tags: #mlconf #mahout #mapr

Large-scale Single-pass k-Means Clustering at Scale

Large-scale Single-pass k-Means Clustering at Scale

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

Very Large-Scale Incremental Clustering

Large-scale Messaging at IMVU

K-MEANS CLUSTERING

Large-scale matching

LARGE SCALE

K-Means Clustering

K-means clustering

Large- scale Organisations

K-means Clustering

Large scale

K-means Clustering

K-means Clustering

Clustering: K-Means

K-means clustering

Very Large-Scale Incremental Clustering

Large Scale Drupal