1 / 28

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms. Michael Sevilla. X. Identifying and Incorporating Latencies in Distributed Data Mining Algorithms. Michael Sevilla. Applicability of Mahout for Large Data Sets. Michael Sevilla. What is Mahout?.

sirius
Download Presentation

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

  2. X Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

  3. Applicability of Mahout for Large Data Sets Michael Sevilla

  4. What is Mahout? • Distributed machine learning libraries • “scalable to reasonably large data sets” • Runs on Hadoop http://heureka.blogetery.com/

  5. The Data: Million Song Data Set • Large Data Set • 1,019,318 users • 384,546 MSD songs • 48,373,586 (user, song, count) • Kaggle Competition: offline evaluation • Predict songs a user will listen to using • Training: 1M user listening history • Validation: 110K users • “Martin L” blogged his methodology + results

  6. Motivations • Can Mahout easily be modified? • Can Mahout perform well for this workload? • Can Mahout produce accurate results? • Can Mahout work ‘out of box’? • Hypothesis: 22 machines + Mahout > 1 guy 22 vs.

  7. What kind of Recommender? • Format: <userID, songID, count> • Users interactingwith items • Users express preferencestowards items • We can us Collaborative Filtering 22 vs.

  8. Collaborative Filtering • Predicts preference of user towards an item • Constructs a Top-N-Recommendation • Parse input training data • Create user-item-matrix • Predict missing entries Mahout has item-based Collaborative Filtering jobs!

  9. Can Mahout easily be modified?

  10. Martin’s Code • Methodology: similarity vector of history • Sparse-matrix • COLISTEN(i, j) – listeners who listened to i and j • Sum similarities for each song user x listens to • The code: all python • Parse: 27 lines of code (l.o.c) • Create Matrix: 46 l.o.c • Predict: 45 l.o.c

  11. Mahout’s Code • Methodology: • No Idea… • The code: all java • Poorly commented • 14 *.java files • Many Directories • ~/mahout/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java • RecommenderJob.java: 284 lines of code (l.o.c) • SimilarityMatrixRowWrapperMapper.java: 47 l.o.c • UserVectorSplitterMapper.java: 138 l.o.c

  12. Mahout’s Code

  13. Can Mahout easily be modified? NO

  14. Can Mahout perform well for this workload?

  15. Martin’s Code • Performance on 86MB: • Parse data: 10 minutes • Make Matrix: 22 minutes • Predict songs for 11000 users: 1 hour, 18 minutes • Did not test scalability $/ python convertToNumbers.py $/ python colisten.py $/ python predict_colisten.py

  16. Mahout’s Code • Performance on 86MB: • Parse Time: 10 minutes • Total Time: 25 minutes • Tested scalability • 64MB, 128MB, 256MB, 1GB, 2GB, 3GB

  17. Mahout’s Code • Total Time • ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr …. 10 Nodes Failed

  18. Mahout’s Code • Prepare Jobs (parse): seconds - minutes

  19. Mahout’s Code • Recommend Jobs (predict): seconds - minutes

  20. Mahout’s Code • Create Matrix Jobs: minutes - hours

  21. Can Mahout perform well for this workload? NO

  22. Can Mahout produce accurate results?

  23. Training Set • Kaggle Million Song Subset: 110K users • User 2: 16 entries – took out 8 • User 16: 32 entries – took out 8 • User 17: 25 entries – took out 8

  24. Martin’s Code where Q is the number of queries User 2: User 16: User 17:

  25. Mahout’s Code where Q is the number of queries User 2: User 16: User 17:

  26. Can Mahout produce accurate results? YES

  27. Can Mahout work ‘out of box’? YES… but not well

  28. Conclusion • Mahout did not scale well • Mahout was not easy to learn • Mahout was not easily modifiable • For performance and efficiency, it is better to • Understand the data set • Understand data mining • Understand the methodology

More Related