1 / 9

Apache Mahout

Apache Mahout. Industrial Strength Machine Learning Jeff Eastman. Current Situation. Large volumes of data are now available Platforms now exist to run computations over large datasets (Hadoop, HBase) Sophisticated analytics are needed to turn data into information people can use

denis
Download Presentation

Apache Mahout

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Mahout Industrial Strength Machine Learning Jeff Eastman

  2. Current Situation • Large volumes of data are now available • Platforms now exist to run computations over large datasets (Hadoop, HBase) • Sophisticated analytics are needed to turn data into information people can use • Active research community and proprietary implementations of “machine learning” algorithms • The world needs scalable implementations of ML under open license - ASF

  3. Where is ML Used Today • Internet search clustering • Knowledge management systems • Social network mapping • Taxonomy transformations • Marketing analytics • Recommendation systems • Log analysis & event filtering • Fraud detection

  4. History of Mahout • Summer 2007 • Developers needed scalable ML • Mailing list formed • Community formed • Apache contributors • Academia & industry • Lots of initial interest • Project formed under Apache Lucene • January 25, 2008

  5. Who We Are (so far) Dawid Weiss Otis Gospodetnic Karl Wettin Grant Ingersoll Jeff Eastman Ted Dunning Erik Hatcher Isabel Drost

  6. Current Code Base • Matrix & Vector library • Hama collaboration for very large arrays • Clustering • Canopy • K-Means • Mean Shift • Utilities • Distance Measures • Parameters

  7. Algorithms Under Development • Naïve Bayes • Perceptron • PLSI/EM • Taste Collaborative Filtering Integration • Genetic Programming • Dirichlet Process Clustering

  8. GSoC @ Mahout • Many interesting submissions • 4 projects approved for Mahout (http://code.google.com/soc/2008/asf/about.html) • “Mahout: Parallel implementation of machine learning algorithms”, Farid Bourennani • “Implementing Logistic Regression in Mahout”, Yun Jiang • “Codename Mahout.GA for mahout-machine-learning”, Abdel Hakim Deneche • “To implement Complementary Naïve Bayes and Expectation Maximization algorithm using Map Reduce for Multicore Systems”, Robin Anil

  9. Conclusion • This is just the beginning • High demand for scalable machine learning • Contributors needed who have • Interest, enthusiasm & programming ability • Test driven development readiness • Comfort with the scary math (or bravery) • Interest and/or proficiency with Hadoop • Some large data sets you want to analyze

More Related