280 likes | 509 Views
Enhancing Discovery with Solr and Mahout. Grant Ingersoll Chief Scientist Lucid Imagination. Evolution. Minding the Intersection. Topics. Background Apache Mahout Apache Solr and Lucene Recommendations with Mahout Collaborative Filtering Discovery with Solr and Mahout Discussion.
E N D
Enhancing Discovery with Solr and Mahout Grant Ingersoll Chief Scientist Lucid Imagination
Topics • Background • Apache Mahout • Apache Solr and Lucene • Recommendations with Mahout • Collaborative Filtering • Discovery with Solr and Mahout • Discussion
Apache Lucene in a Nutshell • http://lucene.apache.org/java • Java based Application Programming Interface (API) for adding search and indexing functionality to applications • Fast and efficient scoring and indexing algorithms • Lots of contributions to make common tasks easier: • Highlighting, spatial, Query Parsers, Benchmarking tools, etc. • Most widely deployed search library on the planet
Apache Solr in a Nutshell • http://lucene.apache.org/solr • Lucene-based Search Server + other features and functionality • Access Lucene over HTTP: • Java, XML, Ruby, Python, .NET, JSON, PHP, etc. • Most programming tasks in Lucene are taken care of in Solr • Faceting (guided navigation, filters, etc.) • Replication and distributed search support • Lucene Best Practices
Apache Mahout in a Nutshell http://dictionary.reference.com/browse/mahout • An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org • The Three C’s: • Collaborative Filtering (recommenders) • Clustering • Classification • Others: • Frequent Item Mining • Primitive collections • Math stuff
Recommenders • Collaborative Filtering (CF) • Provide recommendations solely based on preferences expressed between users and items • “People who watched this also watched that” • Content-based Recommendations (CBR) • Provide recommendations based on the attributes of the items and user profile • ‘Modern Family’ is a sitcom, Bob likes sitcoms • => Suggest Modern Family to Bob • Mahout geared towards CF, can be extended to do CBR • Classification can also be used for CBR • Aside: search engines can also solve these problems
To Rate or Not? • In many instances, user’s don’t provide actual ratings • Clicks, views, etc. • Non-Boolean ratings can also often introduce unnecessary noise • Even a low rating often has a positive correlation with highly rated items in the real world • Example: Should we recommend Frankenstein to Bob?
Collaborative Filtering with Mahout • Extensive framework for collaborative filtering • Recommenders • User based • Item based • Slope One • Online and Offline support • Offline can utilize Hadoop Recommendations for User X
User Similarity What should we recommend for User 1? User 2 User 1 User 3 User 4 Item 1 Item 2 Item 3 Item 4
Item Similarity What should we recommend for User 1? User 2 User 1 User 3 User 4 Item 1 Item 2 Item 3 Item 4
Slope One • Intuition: There is a linear relationship between rated items • Y = mX + b where m = 1 • Solve for b upfront based on existing ratings: b = (Y-X) • Find the average difference in preference value for every pair of items • Online can be very fast, but requires up front computation and memory User A: 3.5 – 2 = 1.5 Item 1 (User B) = 3 + 1.5 = 4.5
Online and Offline Recommendations • Online • Predates Hadoop • Designed to run on a single node • Matrix size of ~ 100M interactions • API for integrating with your application • Offline • Hadoop based • Designed to run on large cluster • Several approaches: • RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob
RecommenderJob • Essentially does matrix multiplication using distributed techniques • $MAHOUT_HOME/bin/examples/asf-email-examples.sh X =
Discovery with Solr • Goals: • Guide users to results without having to guess at keywords • Encourage serendipity • Never show empty results • Out of the Box: • Faceting • Spell Checking • More Like This • Clustering (Carrot2) • Extend • Clustering (with Mahout) • Frequent Item Mining (with Mahout)
Clustering • Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content • Solr has search result clustering • Pluggable • Default implementation uses Carrot2 • Mahout has Hadoop based large scale clustering • K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.
Discovery In Action • Pre-reqs: • Apache Ant 1.7.x, Subversion (SVN) • Command Line 1: • svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk • cdsolr-trunk/solr/ • ant example • cd example • java –Dsolr.clustering.enabled=true –jar start.jar • Command Line 2 • cd exampledocs; java –jar post.jar *.xml • http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
Basics • Most Mahout tasks are offline • Solr provides many touch points for integration: • ClusteringEngine • Clustering results • SearchComponent • Suggestions – Related searches, clusters, MLT, spellchecking • UpdateProcessor • Classification of documents • FunctionQuery
Example: FrequentItemset Mining • Discover frequently co-occurring items • Use Case: Related Searches from Solr Logs • Hadoop and sequential versions • Parallel FP Growth • Input: • <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE • Comma, pipe also allowed as delimiters
FIM on Solr Query Logs • Goal: • Extract user queries from Solr logs • Feed into FIM to generate Related Keyword Searches • Context: • Solr Query logs • bin/mahout regexconverter–input $PATH_TO_LOGS --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClassurl --formatterClassfpg • bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce • bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000
Output • Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)
Resources • http://lucene.apache.org • http://mahout.apache.org • http://manning.com/owen • http://manning.com/ingersoll • http://www.lucidimagination.com • grant@lucidimagination.com • @gsingers
Mahout Overview Applications Examples Genetic Freq. Pattern Mining Classification Clustering Recommenders Utilities/Integration Lucene/Vectorizer Math Vectors/Matrices/SVD Collections (primitives) Apache Hadoop See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms