340 likes | 512 Views
Machine Learning with Apache Hama. Tommaso Teofili tommaso [at] apache [dot] org. 1. About me. ASF member having fun with: Lucene / Solr Hama UIMA Stanbol … some others SW engineer @ Adobe R&D. 2. Agenda. Apache Hama and BSP Why machine learning on BSP Some examples Benchmarks.
E N D
Machine Learning with Apache Hama • Tommaso Teofili • tommaso [at] apache [dot] org 1
About me • ASF member having fun with: • Lucene / Solr • Hama • UIMA • Stanbol • … some others • SW engineer @ Adobe R&D 2
Agenda • Apache Hama and BSP • Why machine learning on BSP • Some examples • Benchmarks 3
Apache Hama • Bulk Synchronous Parallel computing framework on top of HDFS for massive scientific computations • TLP since May 2012 • 0.6.0 release out soon • Growing community 4
BSP supersteps • A BSP algorithm is composed by a sequence of “supersteps” 5
BSP supersteps • Each task • Superstep 1 • Do some computation • Communicate with other tasks • Synchronize • Superstep 2 • Do some computation • Communicate with other tasks • Synchronize • … • … • … • Superstep N • Do some computation • Communicate with other tasks • Synchronize 6
Why BSP • Simple programming model • Superstepssemanticis easy • Preserve data locality • Improve performance • Wellsuited for iterative algorithms 7
Apache Hama architecture • BSP Program execution flow 8
Apache Hama • Features • BSP API • M/R like I/O API • Graph API • Job management / monitoring • Checkpoint recovery • Local & (Pseudo) Distributed run modes • Pluggable message transfer architecture • YARN supported • Running in Apache Whirr 10
Apache Hama BSP API • public abstract class BSP<K1, V1, K2, V2, M extends Writable> … • K1, V1 are key, values for inputs • K2, V2 are key, values for outputs • M are they type of messages used for task communication 11
Apache Hama BSP API • public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws .. • public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. • public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. 12
Machine learning on BSP • Lots (most?) of ML algorithms are inherently iterative • Hama ML module currently counts • Collaborative filtering • Clustering • Gradient descent 13
Benchmarking architecture Node Node Node Hama Node Solr DBMS Lucene Mahout HDFS 14
Collaborative filtering • Given user preferences on movies • We want to find users “near” to some specific user • So that that user can “follow” them • And/or see what they like (which he/she could like too) 15
Collaborative filtering BSP • Given a specific user • Iteratively (for each task) • Superstep 1*i • Read a new user preference row • Find how near is that user from the current user • That is finding how near their preferences are • Since they are given as vectors we may use vector distance measures like Euclidean, cosine, etc. distance algorithms • Broadcast the measure output to other peers • Superstep 2*i • Aggregate measure outputs • Update most relevant users • Still to be committed (HAMA-612) 16
Collaborative filtering BSP • Given user ratings about movies • "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8 • "paula" -> 7, 3, 8, 2, 8.5, 0, 0 • "jim” -> 4, 5, 0, 5, 8, 0, 1.5 • "tom" -> 9, 4, 9, 1, 5, 0, 8 • "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0 • We ask for 2 nearest users to “paula” and we get “timothy” and “tom” • user recommendation • We can extract highly rated movies “timothy” and “tom” that “paula” didn’t see • Item recommendation 17
Benchmarks • Fairly simple algorithm • Highly iterative • Comparing to Apache Mahout • Behaves better than ALS-WR • Behaves similarly to RecommenderJob and ItemSimilarityJob 18
K-Means clustering • We have a bunch of data (e.g. documents) • We want to group those docs in k homogeneous clusters • Iteratively for each cluster • Calculate new cluster center • Add doc nearest to new center to the cluster 19
K-Means clustering BSP • Iteratively • Superstep 1*i • Assignment phase • Read vectors splits • Sum up temporary centers with assigned vectors • Broadcast sum and ingested vectors count • Superstep 2*i • Update phase • Calculate the total sum over all received messages and average • Replace old centers with new centers and check for convergence 21
Benchmarks • One rack (16 nodes 256 cores) cluster • 10G network • On average faster than Mahout’s impl 22
Gradient descent • Optimization algorithm • Find a (local) minimum of some function • Used for • solving linear systems • solving non linear systems • in machine learning tasks • linear regression • logistic regression • neural networks backpropagation • … 23
Gradient descent • Minimize a given (cost) function • Give the function a starting point (set of parameters) • Iteratively change parameters in order to minimize the function • Stop at the (local) • minimum • There’s some math but intuitively: • evaluate derivatives at a given point in order to choose where to “go” next 24
Gradient descent BSP • Iteratively • Superstep 1*i • each task calculates and broadcasts portions of the cost function with the current parameters • Superstep 2*i • aggregate and update cost function • check the aggregated cost and iterations count • cost should always decrease • Superstep 3*i • each task calculates and broadcasts portions of (partial) derivatives • Superstep 4*i • aggregate and update parameters 25
Gradient descent BSP • Simplistic example • Linear regression • Given real estate market dataset • Estimate new houses prices given known houses’ size, geographic region and prices • Expected output: actual parameters for the (linear) prediction function 26
Gradient descent BSP • Generate a different model for each region • House item vectors • price -> size • 150k -> 80 • 2 dimensional space • ~1.3M vectors dataset 27
Gradient descent BSP • Dataset and model fit 28
Gradient descent BSP • Cost checking 29
Gradient descent BSP • Classification • Logistic regression with gradient descent • Real estate market dataset • We want to find which estate listings belong to agencies • To avoid buying from them • Same algorithm • With different cost function and features • Existing items are tagged or not as “belonging to agency” • Create vectors from items’ text • Sample vector • 1 -> 1 3 0 0 5 3 4 1 30
Gradient descent BSP • Classification 31
Benchmarks • Not directly comparable to Mahout’s regression algorithms • Both SGD and CGD are inherently better than plain GD • But Hama GD had on average same performance of Mahout’s SGD / CGD • Next step is implementing SGD / CGD on top of Hama 32
Wrap up • Even if • ML module is still “young” / work in progress • and tools like Apache Mahout have better “coverage” • Apache Hama can be particularly useful in certain “highly iterative” use cases • Interesting benchmarks 33
Thanks! 34