1 / 34

Machine Learning with Apache Hama

Machine Learning with Apache Hama. Tommaso Teofili tommaso [at] apache [dot] org. 1. About me. ASF member having fun with: Lucene / Solr Hama UIMA Stanbol … some others SW engineer @ Adobe R&D. 2. Agenda. Apache Hama and BSP Why machine learning on BSP Some examples Benchmarks.

berne
Download Presentation

Machine Learning with Apache Hama

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning with Apache Hama • Tommaso Teofili • tommaso [at] apache [dot] org 1

  2. About me • ASF member having fun with: • Lucene / Solr • Hama • UIMA • Stanbol • … some others • SW engineer @ Adobe R&D 2

  3. Agenda • Apache Hama and BSP • Why machine learning on BSP • Some examples • Benchmarks 3

  4. Apache Hama • Bulk Synchronous Parallel computing framework on top of HDFS for massive scientific computations • TLP since May 2012 • 0.6.0 release out soon • Growing community 4

  5. BSP supersteps • A BSP algorithm is composed by a sequence of “supersteps” 5

  6. BSP supersteps • Each task • Superstep 1 • Do some computation • Communicate with other tasks • Synchronize • Superstep 2 • Do some computation • Communicate with other tasks • Synchronize • … • … • … • Superstep N • Do some computation • Communicate with other tasks • Synchronize 6

  7. Why BSP • Simple programming model • Superstepssemanticis easy • Preserve data locality • Improve performance • Wellsuited for iterative algorithms 7

  8. Apache Hama architecture • BSP Program execution flow 8

  9. Apache Hama architecture 9

  10. Apache Hama • Features • BSP API • M/R like I/O API • Graph API • Job management / monitoring • Checkpoint recovery • Local & (Pseudo) Distributed run modes • Pluggable message transfer architecture • YARN supported • Running in Apache Whirr 10

  11. Apache Hama BSP API • public abstract class BSP<K1, V1, K2, V2, M extends Writable> … • K1, V1 are key, values for inputs • K2, V2 are key, values for outputs • M are they type of messages used for task communication 11

  12. Apache Hama BSP API • public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws .. • public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. • public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. 12

  13. Machine learning on BSP • Lots (most?) of ML algorithms are inherently iterative • Hama ML module currently counts • Collaborative filtering • Clustering • Gradient descent 13

  14. Benchmarking architecture Node Node Node Hama Node Solr DBMS Lucene Mahout HDFS 14

  15. Collaborative filtering • Given user preferences on movies • We want to find users “near” to some specific user • So that that user can “follow” them • And/or see what they like (which he/she could like too) 15

  16. Collaborative filtering BSP • Given a specific user • Iteratively (for each task) • Superstep 1*i • Read a new user preference row • Find how near is that user from the current user • That is finding how near their preferences are • Since they are given as vectors we may use vector distance measures like Euclidean, cosine, etc. distance algorithms • Broadcast the measure output to other peers • Superstep 2*i • Aggregate measure outputs • Update most relevant users • Still to be committed (HAMA-612) 16

  17. Collaborative filtering BSP • Given user ratings about movies • "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8 • "paula" -> 7, 3, 8, 2, 8.5, 0, 0 • "jim” -> 4, 5, 0, 5, 8, 0, 1.5 • "tom" -> 9, 4, 9, 1, 5, 0, 8 • "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0 • We ask for 2 nearest users to “paula” and we get “timothy” and “tom” • user recommendation • We can extract highly rated movies “timothy” and “tom” that “paula” didn’t see • Item recommendation 17

  18. Benchmarks • Fairly simple algorithm • Highly iterative • Comparing to Apache Mahout • Behaves better than ALS-WR • Behaves similarly to RecommenderJob and ItemSimilarityJob 18

  19. K-Means clustering • We have a bunch of data (e.g. documents) • We want to group those docs in k homogeneous clusters • Iteratively for each cluster • Calculate new cluster center • Add doc nearest to new center to the cluster 19

  20. K-Means clustering 20

  21. K-Means clustering BSP • Iteratively • Superstep 1*i • Assignment phase • Read vectors splits • Sum up temporary centers with assigned vectors • Broadcast sum and ingested vectors count • Superstep 2*i • Update phase • Calculate the total sum over all received messages and average • Replace old centers with new centers and check for convergence 21

  22. Benchmarks • One rack (16 nodes 256 cores) cluster • 10G network • On average faster than Mahout’s impl 22

  23. Gradient descent • Optimization algorithm • Find a (local) minimum of some function • Used for • solving linear systems • solving non linear systems • in machine learning tasks • linear regression • logistic regression • neural networks backpropagation • … 23

  24. Gradient descent • Minimize a given (cost) function • Give the function a starting point (set of parameters) • Iteratively change parameters in order to minimize the function • Stop at the (local) • minimum • There’s some math but intuitively: • evaluate derivatives at a given point in order to choose where to “go” next 24

  25. Gradient descent BSP • Iteratively • Superstep 1*i • each task calculates and broadcasts portions of the cost function with the current parameters • Superstep 2*i • aggregate and update cost function • check the aggregated cost and iterations count • cost should always decrease • Superstep 3*i • each task calculates and broadcasts portions of (partial) derivatives • Superstep 4*i • aggregate and update parameters 25

  26. Gradient descent BSP • Simplistic example • Linear regression • Given real estate market dataset • Estimate new houses prices given known houses’ size, geographic region and prices • Expected output: actual parameters for the (linear) prediction function 26

  27. Gradient descent BSP • Generate a different model for each region • House item vectors • price -> size • 150k -> 80 • 2 dimensional space • ~1.3M vectors dataset 27

  28. Gradient descent BSP • Dataset and model fit 28

  29. Gradient descent BSP • Cost checking 29

  30. Gradient descent BSP • Classification • Logistic regression with gradient descent • Real estate market dataset • We want to find which estate listings belong to agencies • To avoid buying from them  • Same algorithm • With different cost function and features • Existing items are tagged or not as “belonging to agency” • Create vectors from items’ text • Sample vector • 1 -> 1 3 0 0 5 3 4 1 30

  31. Gradient descent BSP • Classification 31

  32. Benchmarks • Not directly comparable to Mahout’s regression algorithms • Both SGD and CGD are inherently better than plain GD • But Hama GD had on average same performance of Mahout’s SGD / CGD • Next step is implementing SGD / CGD on top of Hama  32

  33. Wrap up • Even if • ML module is still “young” / work in progress • and tools like Apache Mahout have better “coverage” • Apache Hama can be particularly useful in certain “highly iterative” use cases • Interesting benchmarks 33

  34. Thanks! 34

More Related