Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Applying Semantic Analyses to Content-based Recommendation and Document Clustering Eric Rozell, MRC Intern Rensselaer Polytechnic Institute

Bio • Graduate Student @ Rensselaer Polytechnic Institute • Research Assistant @ Tetherless World Constellation • Student Fellow @ Federation of Earth Science Informatics Partners • Research Advisor: Peter Fox • Research Focus: Semantic eScience • Contact: rozele@rpi.edu

Outline Background Semantic Analysis • Background • Semantic Analysis • Probase Conceptualization • Explicit Semantic Analysis • Latent DirichletAllocation • Recommendation Experiment • Recommendation Systems • Experiment Setup • Results • Clustering Experiment • Problem • K-Means • Results • Conclusions Recommendation Clustering Conclusions

Background Background Semantic Analysis • Billions of documents on the Web • Semi-structured data from Web 2.0(e.g., tags, microformats) • Most knowledge remains in unstructured text • Many natural language techniquesfor: • Ontology extraction • Topic extraction • Named entity recognition/disambiguation • Some techniques are better than others for various information retrieval tasks… Recommendation Clustering Conclusions

Probase Background Semantic Analysis • Developed at Microsoft Research Asia • Probabilistic knowledge base built from Bing index and query logs (and other sources) • Text mining patterns • Namely, Hearst patterns: “… artists such as Picaso” • Evidence for hypernym(artists, Picaso) Recommendation Clustering Conclusions

Probase Background Semantic Analysis Recommendation Clustering Conclusions

Probase Background Semantic Analysis • Very capable at conceptualizing groups of entities: • “China; India; United States” yields “country” • “China; India; Brazil; Russia” yields “emerging market” • Differentiates attributes and entities • “birthday” -> “person” as attribute • “birthday” -> “occasion” as entity • Applications • Clustering Tweets from Concepts [Song et al., 2011] • Understanding Web Tables • Query Expansion (Topic Search) Recommendation Clustering Conclusions

Research Questions Background Semantic Analysis • What’s the best way of extracting concepts from text? • Compare techniques for semantic analysis • How are extracted concepts useful? • Generate data about where semantic analysis techniques are applicable • Are user ratings affected by the concepts in media items such as movies? • Test semantic analysis techniques in recommender systems • How useful is Web-scale domain knowledge in narrower domains for information retrieval? • Identify need for domain specific knowledge Recommendation Clustering Conclusions

Semantic Analysis Background Semantic Analysis • Generating meaning (concepts) from text • Specifically, get prevalent hypernyms • E.g., “… Apple, IBM, and Microsoft …” • “technology companies” • Semantic analysis using external knowledge • Probase Conceptualization • Explicit Semantic Analysis • WordNetSynsets • Semantic analysis using latent features • Latent Dirichlet Allocation • Latent Semantic Analysis Recommendation Clustering Conclusions

Probase Conceptualization Background Semantic Analysis For each document… t1 c1 c1 c1 c1 c1 . . . Probase Naïve Bayes / Summation Document Concepts This is some plain text. Document Corpus Inverse Document Frequency / Filtering t2 c2 c2 c2 c2 c2 t3 c3 Recommendation c3 c3 c3 c3 c4 t4 c4 c4 c4 c4 . . . . . . . . . . . . . . . . . . Clustering Conclusions

Probase Conceptualization Background Semantic Analysis • “CowboydollWoody(Tom Hanks) is co ordinating a reconnaissance missionto find out what presents his ownerAndyis getting for his birthday partydays before theymove to a newhouse. Unfortunately for Woody, Andyreceives a new spacemantoy, Buzz Lightyear(Tim Allen) who impresses the other toys and Andy, who starts to like Buzz more than Woody. Buzz thinks that he is an actual space ranger, not a toy, and thinks that Woodyis interfering with his "mission" to return to his homeplanet…” Recommendation Clustering Conclusions Text Source: Internet Movie Database (IMDb)

Sample Features for “Toy Story” (Probase) Background Semantic Analysis • dvd encryptions0.050  “RC” • duty free item0.044  “toys” • generic word0.043  “they, travel, it,…” • satellite mission0.032  “reconnaissance mission” • creator-owned work0.020  “Woody” • amazing song0.013  “fury” • doubtful word0.013  “overcome” • ill-fated tool0.013  “Buzz” • lovable ``toy story'' character0.011  “Buzz Lightyear, Woody,…” • pleased star0.010  “Woody” • trail builder0.010  “Woody” Recommendation Clustering Conclusions

Explicit Semantic Analysis Background Semantic Analysis Recommendation Clustering Conclusions Image Source: Gabrilovich et al., 2007

Sample Features for “Toy Story” (ESA) Background Semantic Analysis • #REDIRECT [[Buzz!]]0.034 • #REDIRECT [[The Buzz]] 0.028 • #REDIRECT [[Buzz (comics)]]0.027 • #REDIRECT [[Buzz cut]]0.027 • #REDIRECT [[Buzz (DC Thomson)]]0.024 • #REDIRECT [[Buzz Out Loud]]0.024 • #REDIRECT [[The Daily Buzz]] 0.023 • #REDIRECT [[Buzz Aldrin]]0.022 • #REDIRECT [[Buzz cut]] 0.022 • #REDIRECT [[Buzzing Tree Frog]]0.022 Recommendation Clustering Conclusions

Latent Dirichlet Allocation Background Semantic Analysis • Blei et al., 2003 • Unsupervised Learning Method • “Generates” documents from Dirichlet distributions over words and topics • Topic distributions over documents can be inferred from corpus Recommendation Clustering Conclusions Image Source: Wikipedia

Recommendation Systems Background Semantic Analysis • Collaborative Filtering • “Customers who purchased X also purchased Y.” • Content-based • “Because you enjoyed ‘GoldenEye’, you may want to watch ‘Mission: Impossible’.” • Hybrid • Most modern systems take a hybrid approach. Recommendation Clustering Conclusions

Content-based Recommendation Background Semantic Analysis • In GoldenEye/Mission: Impossible example… • Structured item content • Genre – Action/Adventure/Thriller • Tags – Action, Espionage, Adventure • Unstructured item content • Plot synopses – “helicopter, agent, inflitrate, CIA, …” • Concepts? – “aircraft, intelligence agency, …” Recommendation Clustering Conclusions

Recommendation Systems Background Semantic Analysis Structured Content-based Approaches Recommendation Collaborative Filtering Approaches Clustering Unstructured Content-based Approaches Conclusions Test semantic analysis approaches here.

Experiment Background Semantic Analysis Movie Ratings from MovieLens Matchbox Recommendation Platform Feature Generation Recommendation Movie Synopses from IMDb Mean Absolute Error (MAE) Clustering Conclusions

Matchbox Semantic Analysis Recommendation Clustering Related Work Conclusions Source: Matchbox API Documentation

Experimental Data Background Semantic Analysis • Data: MovieLens Dataset [HetRec ’11] • 855,598 ratings • 10,197 movies • 2,113 users • Movie synopses from IMDb (http://www.imdb.com) • Collected synopses for 2,633 movies • With 435,043 ratings • From 2,113 users • Ratings data: • Scored by half points from 0.5 to 5 • Choose different numbers of movies (200; 1,000; all) • Train on 90% of ratings, test on remaining 10% Recommendation Clustering Conclusions

Experimental Data Background Semantic Analysis • Controls • Baseline 1: Only features are user IDs and movie IDs • Baseline 2: User IDs, Movie IDs, Movie Genre • Baseline 3: User IDs, Movie IDs, Movie Tags • Feature Sets • Term Frequency – Inverse Document Frequency • Latent Dirichlet Allocation • Explicit Semantic Analysis • Probase Conceptualization Recommendation Clustering Conclusions

Experimental Setup Background Semantic Analysis • 4 Scenarios: (training: white, testing: black) Movies Movies Recommendation Users Users Clustering Movies Movies Conclusions Users Users

Results Background Semantic Analysis Recommendation Clustering Conclusions

Results Background Semantic Analysis Recommendation Clustering • testing set contains users and movies not seen in training set • recommendations based on item features alone • small amounts of structured data (e.g., genre) are the most influential in this scenario Conclusions

Results Background Semantic Analysis Recommendation Clustering • testing set contains users not seen in training set. • lots of collaborative data available (explains comparable performance in all feature sets) • given extensive collaborative data, item features are marginally beneficial (in Matchbox) Conclusions

Results Background Semantic Analysis Recommendation Clustering • testing set contains movies not seen in the training set • recommendations based on item features and extensive information on users “rating model” • small amounts of structured data (e.g., genre) are the most influential in this scenario (even for long-term users) Conclusions

Results Background Semantic Analysis Recommendation Clustering • testing set contains users and movies seen in the training set • recommendations again are primarily collaborative • given a large corpus of rating data for users and items, item features are only marginally beneficial Conclusions

Document Clustering Background Semantic Analysis • Divide a corpus into a specified number of groups • Useful for information retrieval • Automatically generated topics for search results • Recommendations for similar items/pages • Visualization of search space Recommendation Clustering Conclusions

K-Means Background Semantic Analysis • Start with initial clusters • Compute means of clusters • Compare cosine distance of each item to means • Assign to clusters to based on min. distance • Repeat from step 2 until convergence Recommendation Clustering Conclusions

Experimental Setup Background Semantic Analysis • Generate features for datasets • Randomly assign initial clusters • Run K-Means • Compute purity and ARI • Repeat steps 2-4 20 times for mean and standard deviation Recommendation Clustering Conclusions

Experimental Data Background Semantic Analysis From sci.electronics… “A couple of years ago I put together a Tesla circuit which was published in an electronics magazine and could have been the circuit which is referred to here. This one used a flybacktransformer from a tv onto which you wound your own primary windings...” • 20 Newsgroups (mini) • 2,000 messages from Usenet newsgroups • 100 messages per topic • Filter messages for body text • Source: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html Recommendation Clustering Conclusions

Results Comparison Background Semantic Analysis • Song et al. Tweets Clustering • Experiment #2: Subtle Cluster Distinctions • Used Tweets about NA, Asia, Africa and Europe • Comparable performance for ESA and Probase Conceptualization • Hotho et al. WordNet Clustering • Used Reuters dataset and Bisecting K-Means • Found best results for combined TF-IDF and feature sets • Overall improvement from WordNet features was comparable to Probase features (O[+10%]) Recommendation Clustering Conclusions

Conclusions Background Semantic Analysis • Semantic analysis features are marginally beneficial in recommendation • Structured data from limited vocabulary work best for recommending “new items” • Explicit and latent semantic analysis are comparable in recommendation • Knowledge bases generated at Web-scale may be too noisy for narrow domain tasks • Confirmed the efficacy of semantic analysis in document clustering tasks Recommendation Clustering Conclusions

Future Directions Background Semantic Analysis • Noise Reduction • Tune the recommender platform for “concepts” • Further explore parameter space for feature generators • Hybrid Conceptualization / Named Entity Disambiguation? • Domain-specific knowledge sources • Comparison of Web-scale and domain-specific resources as external knowledge (e.g., [Aljaber et al., 2010]) Recommendation Clustering Conclusions

Further Reading Background Semantic Analysis • Short Text Conceptualization Using a Probabilistic Knowledge Base [Song et al., 2011] • Exploiting Wikipedia as External Knowledge for Document Clustering [Hu et al., 2009] • Hybrid Recommender Using WordNet “Bag of Synsets” [Degemmis et al., 2007] • Hybrid Recommender Using LDA [Jin et al., 2005] • Feature Generation for Text Categorization Using World Knowledge [Gabrilovich and Markovitch, 2005] • WordNet Improves Text Document Clustering [Hotho et al., 2003] Recommendation Clustering Conclusions

Acknowledgements • David Stern, Ulrich Paquet, Jurgen Van Gael • Haixun Wang, Yangqiu Song, Zhongyuan Wang • Special thanks to Evelyne Viegas! • Microsoft Research Connections

References • [Gabrilovich et al., 2007] EvgeniyGabrilovich and ShaulMarkovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'07), Rajeev Sangal, Harish Mehta, and R. K. Bagga (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1606-1611. • [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993-1022. • [Song et al., 2011] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011. • [Stern et al., 2009] David H. Stern, Ralf Herbrich, and ThoreGraepel. 2009. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 111-120. • [HetRec ’11] Ivan Cantador, Peter Brusilovsky, and TsviKuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems. ACM, New York, NY, USA. • [Degemmis et al., 2007] Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. A content-collaborative recommender that exploits WordNet-based user profiles for neighborhood formation. User Modeling and User-Adapted Interaction. Vol. 17, Issue 3, 217-255.

References • [Jin et al., 2005] Xin Jin, Yanzan Zhou, and BamshadMobasher. 2005. A maximum entropy web recommendation system: combining collaborative and content features. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, New York, NY, USA, 612-617. • [Hu et al., 2009] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09). ACM, New York, NY, USA, 389-396. • [Gabrilovich and Markovitch, 2005] EvgeniyGabrilovich and ShaulMarkovitch. 2005. Feature generation for text categorization using world knowledge. In Proceedings of the 20th international joint conference on Artifical intelligence (IJCAI'05), 1606-1611. • [Hotho et al., 2003] Andreas Hotho, Steffen Staab, and GerdStumme. 2003. Wordnet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, 541-544. • [Aljaber et al., 2010] Bader Aljaber, Nicola Stokes, James Bailey, and Jian Pei. 2010. Document clustering of scientific texts using citation contexts. Information Retrieval. Vol. 13, Issue 2, 101-131.

Questions? • Thanks for attending

Appendix • Matchbox Details • Implementation Details • Probase Conceptualization Details • Explicit Semantic Analysis Details • Learnings from Probase

(Appendix A) Matchbox Semantic Analysis • [Stern et al., 2009] • MSR Cambridge recommendation platform • Implements a hybrid recommender using Infer.NET • Uses combination of expectation propagation (EP) and variational message passing • Reduces user, item, and context features to low dimensional trait space Recommendation Clustering Related Work Conclusions

(Appendix A) Matchbox Setup Semantic Analysis • Matchbox settings • Use 20 trait dimension (determined experimentally) • 10 iterations of EP algorithm • Trained on approx. 90% of ratings • Updated model with 75% of ratings per user (in remaining 10%) • MAE computed for remaining 25% per user Recommendation Clustering Related Work Conclusions

(Appendix B) Implementation Semantic Analysis • ESA: https://github.com/faraday/wikiprep-esa • LDA: Infer.NET • Probase: Probase Package v. 0.18 • TF-IDF: http://www.codeproject.com/KB/cs/tfidf.aspx • Matchbox: http://codebox/matchbox Recommendation Clustering Related Work Conclusions

(Appendix C) Probase Conceptualization Background Semantic Analysis • Identify all Probase terms in text • Use Noisy-or Model to combine: • Concepts from tlas attribute (zl = 1) • Concepts from tl as entity/concept (zl = 0) Recommendation Clustering Conclusions

(Appendix C) Probase Conceptualization Background Semantic Analysis • Weight terms based on occurrence • Naïve Bayes (similar to Song et al., 2010) • Compute P(c|t) for individual terms and use Naïve Bayes model to derive concepts • Penalizes false positives, does not reward true positives • Generates very small probabilities for large numbers of terms • Weighted Sum (similar to Gabrilovich et al., 2007) • Compute P(c|t) for individual terms and compute sum over document for each concept • Rewards true positives, does not penalize false positives (accurate concepts and inaccurate concepts, resp.) Recommendation Clustering Conclusions

(Appendix C) Probase Conceptualization Background Semantic Analysis • Penalize frequent concepts • Stop word (concepts) are domain-independent • For films, many domain-specific stop concepts • E.g., “movie”, “character”, “actor”, etc. • Inverse Document Frequency on concepts penalizes those that are too frequent • Also rewards those that are too infrequent (in only one document) • Solution: Filter for minimum and maximum occurrence Recommendation Clustering Conclusions

(Appendix C) ProbaseConceptualization Semantic Analysis • Using Summation (similar to Wikipedia ESA) • Using Naïve Bayes from Song et al. approach • P(|T) P(T|)P()/P(T) / P()L - 1 • Inverse Document Frequency for concepts • IDF(ck) = log ( # of documents / document frequency of ck ) • Minimum occurrence = 2 • Maximum occurrence = 0.5 * # of documents Recommendation Clustering Related Work Conclusions

Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Applying Semantic Analyses to Content-based Recommendation and Document Clustering

Presentation Transcript

Content-based Recommendation Systems

Web Document Clustering

Semantic-based Language Models for Text Retrieval and Clustering

Semantic Smoothing of Document Models for Agglomerative Clustering

A Clustering Based Approach to Creating Multi-Document Summaries

Applying Content Theories

Content-based recommendation

Web Mining: Phrase-based Document Indexing and Document Clustering

Document Clustering

Web Document Clustering

Semantic Content based Modeling

Semantic Content-based Access To Hypervideo Databases

Instance Based Clustering of Semantic Web Resources

Clustering-based Collaborative filtering for web page recommendation

Term and Document Clustering

Document Clustering

A Latent Semantic Indexing-based approach to multilingual document clastering

Capturing and Applying Existing Knowledge to Semantic Applications

Content-based recommendation

Web Document Clustering