130 likes | 308 Views
Ricardo: Integrating R and Hadoop. Team: #19 Presenter: Xiaozhe Wang Yue Gu. Agenda. Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference. Background. Data Mining Examples. Eg :
E N D
Ricardo:Integrating R and Hadoop Team: #19 Presenter: Xiaozhe Wang YueGu
Agenda • Background • Introduction to R • Disadvantages for Current Strategies • Introduction to Ricardo • Overview of Ricardo’s Architecture • Evaluation • Reference
Background DataMiningExamples • Eg: • Amazon personalized recommendation of products • Netfix recommend the movies to the customer by the taste of this customer
Introduction to R R’s functionalityforDataMining • Principal and independent component analysis • k-means clustering • SVM classification • Generalized-linear • Latent-factor • Bayesian • Time- series
Introduction to R R: Simplified Method for Data Mining Kmeans Algorithm Kmeans on R
Disadvantages for Current Strategies in Scalability for Data Mining Disadvantages for Current Strategies • Exploit vertical scalability • Limited • Expensive • Sample the dataset • Lose important features • Lose the accuracy • Large-scale management system(DMS) • Less functionality
Introduction to Ricardo Ricardo: R and Hadoop
Architecture Overview of Ricardo’s Architecture
Evaluation Performance and Scalability • Object:Simulate a real recommender system • Original Netflix competition dataset • Jaqlrequires about twice as much time as raw Hadoop. • higher level of abstraction
Conclusion Conclusion • Ricardo, a scalable platform
Reference S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, andJ. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD2010. http://www.mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf