Which Algorithms Really Matter?

Which Algorithms Really Matter?

Me, Us • Ted Dunning, Chief Application Architect, MapR Committer PMC member, Mahout, Zookeeper, Drill Bought the beer at the first HUG • MapR Distributes more open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • Info Hash tag - #mapr See also - @ApacheMahout @ApacheDrill @ted_dunning and @mapR

Topic For Today • What is important? What is not? • Why? • What is the difference from academic research? • Some examples

What is Important? • Deployable • Robust • Transparent • Skillset and mindset matched? • Proportionate

What is Important? • Deployable • Clever prototypes don’t count if they can’t be standardized • Robust • Transparent • Skillset and mindset matched? • Proportionate

What is Important? • Deployable • Clever prototypes don’t count • Robust • Mishandling is common • Transparent • Will degradation be obvious? • Skillset and mindset matched? • Proportionate

What is Important? • Deployable • Clever prototypes don’t count • Robust • Mishandling is common • Transparent • Will degradation be obvious? • Skillset and mindset matched? • How long will your fancy data scientist enjoy doing standard ops tasks? • Proportionate • Where is the highest value per minute of effort?

Academic Goals vs Pragmatics • Academic goals • Reproducible • Isolate theoretically important aspects • Work on novel problems • Pragmatics • Highest net value • Available data is constantly changing • Diligence and consistency have larger impact than cleverness • Many systems feed themselves, exploration and exploitation are both important • Engineering constraints on budget and schedule

Example 1: Making Recommendations Better

Recommendation Advances • What are the most important algorithmic advances in recommendations over the last 10 years? • Cooccurrence analysis? • Matrix completion via factorization? • Latent factor log-linear models? • Temporal dynamics?

The Winner – None of the Above • What are the most important algorithmic advances in recommendations over the last 10 years? 1. Result dithering 2. Anti-flood

The Real Issues • Exploration • Diversity • Speed • Not the last fraction of a percent

Result Dithering • Dithering is used to re-order recommendation results • Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better

Result Dithering • Dithering is used to re-order recommendation results • Re-ordering is done randomly • Dithering is guaranteed to make off-line performance worse • Dithering also has a near perfect record of making actual performance much better “Made more difference than any other change”

Simple Dithering Algorithm • Generate synthetic score from log rank plus Gaussian • Pick noise scale to provide desired level of mixing • Typically • Oh… use floor(t/T) as seed

Example … ε = 0.5

Example … ε= log 2 = 0.69

Exploring The Second Page

Lesson 1: Exploration is good

Example 2: Bayesian Bandits

Bayesian Bandits • Based on Thompson sampling • Very general sequential test • Near optimal regret • Trade-off exploration and exploitation • Possibly best known solution for exploration/exploitation • Incredibly simple

Thompson Sampling • Select each shell according to the probability that it is the best • Probability that it is the best can be computed using posterior • But I promised a simple answer

Thompson Sampling – Take 2 • Sample θ • Pick i to maximize reward • Record result from using i

Fast Convergence

Thompson Sampling on Ads An Empirical Evaluation of Thompson Sampling - Chapelleand Li, 2011

Bayesian Bandits versus Result Dithering • Many useful systems are difficult to frame in fully Bayesian form • Thompson sampling cannot be applied without posterior sampling • Can still do useful exploration with dithering • But better to use Thompson sampling if possible

Lesson 2: Exploration is pretty easy to do and pays big benefits.

Example 3: On-line Clustering

The Problem • K-means clustering is useful for feature extraction or compression • At scale and at high dimension, the desirable number of clusters increases • Very large number of clusters may require more passes through the data • Super-linear scaling is generally infeasible

The Solution • Sketch-based algorithms produce a sketch of the data • Streaming k-means uses adaptive dp-means to produce this sketch in the form of many weighted centroids which approximate the original distribution • The size of the sketch grows very slowly with increasing data size • Many operations such as clustering are well behaved on sketches Fast and Accurate k-means For Large Datasets. Michael Shindler, Alex Wong, Adam Meyerson. Revisiting k-means: New Algorithms via Bayesian Nonparametrics. Brian Kulis, Michael Jordan.

An Example

The Cluster Proximity Features • Every point can be described by the nearest cluster • 4.3 bits per point in this case • Significant error that can be decreased (to a point) by increasing number of clusters • Or by the proximity to the 2 nearest clusters (2 x 4.3 bits + 1 sign bit + 2 proximities) • Error is negligible • Unwinds the data into a simple representation • Or we can increase the number of clusters (n fold increase adds log n bits per point, decreases error by sqrt(n)

Diagonalized Cluster Proximity

Lots of Clusters Are Fine

Typical k-means Failure Selecting two seeds here cannot be fixed with Lloyds Result is that these two clusters get glued together

Streaming k-means Ideas • By using a sketch with lots (k log N) of centroids, we avoid pathological cases • We still get a very good result if the sketch is created • in one pass • with approximate search • In fact, adaptive dp-means works just fine • In the end, the sketch can be used for clustering or …

Lesson 3: Sketches make big data small.

Example 4: Search Abuse

Recommendations Alice got an apple and a puppy Alice Charles got a bicycle Charles

Recommendations Alice got an apple and a puppy Alice Bob got an apple Bob Charles got a bicycle Charles

Recommendations Alice ? What else would Bob like? Bob Charles

Log Files Alice Charles Charles Alice Alice Bob Bob

History Matrix: Users by Items ✔ ✔ ✔ Alice ✔ ✔ Bob ✔ ✔ Charles

Co-occurrence Matrix: Items by Items How do you tell which co-occurrences are useful?. 1 2 0 - 0 1 1 1 1 2 0 1 0

Co-occurrence Binary Matrix not 1 1 1 not

Indicator Matrix: Anomalous Co-Occurrence Result: The marked row will be added to the indicator field in the item document… ✔ ✔

Indicator Matrix That one row from indicator matrix becomes the indicator field in the Solr document used to deploy the recommendation engine. ✔ id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet indicators: (t1) Note: data for the indicator field is added directly to meta-data for a document in Solr index. You don’t need to create a separate index for the indicators.

Internals of the Recommender Engine

Which Algorithms Really Matter?