400 likes | 432 Views
Contextual Text Mining with Probabilistic Topic Models. ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu.
E N D
Contextual Text Mining with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu Joint work with Qiaozhu Mei LLNL, Aug 15, 2007
Motivating Example:Comparing Product Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Unsupervised discovery of common topics and their variations LLNL, Aug 15, 2007
Motivating Example:Comparing News about Similar Topics Vietnam War Afghan War Iraq War Unsupervised discovery of common topics and their variations LLNL, Aug 15, 2007
Motivating Example:Discovering Topical Trends in Literature Theme Strength Time 1980 1990 1998 2003 TF-IDF Retrieval Language Model Text Categorization IR Applications Unsupervised discovery of topics and their temporal variations LLNL, Aug 15, 2007
Motivating Example:Analyzing Spatial Topic Patterns • How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”? • Unsupervised discovery of topics and their variations in different locations LLNL, Aug 15, 2007
Motivating Example: Sentiment Summary Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics LLNL, Aug 15, 2007
Research Questions • Can we model all these problems generally? • Can we solve these problems with a unified approach? • How can we bring human into the loop? LLNL, Aug 15, 2007
Rest of Talk • Contextual Text Mining • The CPLSA Model • Sample results of specific CPLSA models • Discussion LLNL, Aug 15, 2007
Contextual Text Mining • Given collections of text with contextual information (meta-data) • Discover themes/subtopics/topics (interesting word clusters) • Compute variations of themes over contexts • Applications: • Summarizing search results • Federation of text information • Opinion analysis • Social network analysis • Business intelligence • .. LLNL, Aug 15, 2007
Context Features of Text (Meta-data) Weblog Article communities Author source Location Time Author’s Occupation LLNL, Aug 15, 2007
Context = Partitioning of Text papers written in 1998 Papers about Web papers written by authors in US 1998 1999 …… …… 2005 2006 WWW SIGIR ACL KDD SIGMOD LLNL, Aug 15, 2007
Themes/Topics • Uses of themes: • Summarize topics/subtopics • Navigate in a document space • Retrieve documents • Segment documents • … government 0.3 response 0.2.. [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleansmetropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … Theme 1 donate 0.1relief 0.05help 0.02 .. Theme 2 … city 0.2new 0.1orleans 0.05 .. Theme k Is 0.05the 0.04a 0.03 .. Background B LLNL, Aug 15, 2007
View of Themes: Context-Specific Version of Views vector space TF-IDF Theme 2: Feedback Okapi vector retrieve Theme 1: Retrieval Model LSI feedback Rocchio model retrieval weighting judge relevance feedback expansion term document pseudo query query language mixture model model estimate smoothing EM query feedback generation pseudo Context: After 1998 (Language models) Context: Before 1998 (Traditional models) LLNL, Aug 15, 2007
Coverage of Themes: Distribution over Themes Oil Price Criticismofgovernment responseto the hurricane primarily consisted ofcriticismof itsresponse to … The totalshut-in oil productionfrom the Gulf of Mexico … approximately 24% of theannual productionand the shut-ingas production … Over seventy countriespledged monetary donationsor otherassistance. … Government Response Aid and donation Background Context: Texas Oil Price Government Response • Theme coverage can depend on context Aid and donation Background Context: Louisiana LLNL, Aug 15, 2007
General Tasks of Contextual Text Mining • Theme Extraction:Extract the global salient themes • Common information shared over all contexts • View Comparison:Compare a theme from different views • Analyze the content variation of themes over contexts • Coverage Comparison: Compare the theme coverage of different contexts • Reveal how closely a theme is associated to a context • Others: • Causal analysis • Correlation analysis LLNL, Aug 15, 2007
A General Solution: CPLSA • CPLAS = Contextual Probabilistic Latent Semantic Analysis • An extension of PLSA model ([Hofmann 99]) by • Introducing context variables • Modeling views of topics • Modeling coverage variations of topics • Process of contextual text mining • Instantiation of CPLSA (context, views, coverage) • Fit the model to text data (EM algorithm) • Compute probabilistic topic patterns LLNL, Aug 15, 2007
“Generation” Process of CPLSA View1 View2 View3 Themes government 0.3 response 0.2.. new donate government government donate 0.1relief 0.05help 0.02 .. donation city 0.2new 0.1orleans 0.05 .. New Orleans Theme coverages: …… Texas document July 2005 Choose a theme Criticismofgovernment responseto the hurricane primarily consisted ofcriticismof itsresponse to … The totalshut-in oil productionfrom the Gulf of Mexico … approximately 24% of theannual productionand the shut-ingas production … Over seventy countriespledged monetary donationsor otherassistance. … Draw a word from i Documentcontext: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … response help aid Orleans Texas July 2005 sociologist Choose a view Choose a Coverage LLNL, Aug 15, 2007
Probabilistic Model • To generate a document D with context feature set C: • Choose a view viaccording to the view distribution • Choose a coverage кjaccording to the coverage distribution • Choose a theme according to the coverage кj • Generate a word using • The likelihood of the document collection is: LLNL, Aug 15, 2007
Parameter Estimation: EM Algorithm • Interesting patterns: • Theme content variation for each view: • Theme strength variation for each context • Prior from a user can be incorporated using MAP estimation LLNL, Aug 15, 2007
Regularization of the Model • Why? • Generality high complexity (inefficient, multiple local maxima) • Real applications have domain constraints/knowledge • Two useful simplifications: • Fixed-Coverage: Only analyze the content variation of themes (e.g., author-topic analysis, cross-collection comparative analysis ) • Fixed-View: Only analyze the coverage variation of themes (e.g., spatiotemporal theme analysis) • In general • Impose priors on model parameters • Support the whole spectrum from unsupervised to supervised learning LLNL, Aug 15, 2007
Interpretation of Topics Statistical topic models term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Multinomial topic models Collection (Context) Coverage; Discrimination Relevance Score Re-ranking clustering algorithm; distance measure; … database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … NLP Chunker Ngram stat. Ranked Listof Labels Candidate label pool LLNL, Aug 15, 2007
Relevance: the Zero-Order Score • Intuition: prefer phrases covering high probability words Clustering Good Label (l1): “clustering algorithm” dimensional algorithm Latent Topic … birch shape Bad Label (l2):“body shape” … p(w|) body LLNL, Aug 15, 2007
Relevance: the First-Order Score C: SIGMOD Proceedings • Intuition: prefer phrases with similar context (distribution) Clustering Clustering Clustering dimension dimension dimension Bad Label (l2):“hash join” Good Label (l1):“clustering algorithm” Topic … partition partition algorithm algorithm algorithm join … … Score (l, ) hash hash hash P(w|l2) P(w|) P(w|l1) D(||l1) < D(||l2) LLNL, Aug 15, 2007
Sample Results • Comparative text mining • Spatiotemporal pattern mining • Sentiment summary • Event impact analysis • Temporal author-topic analysis LLNL, Aug 15, 2007
Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars LLNL, Aug 15, 2007
Comparing Laptop Reviews Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents LLNL, Aug 15, 2007
Spatiotemporal Patterns in Blog Articles • Query= “Hurricane Katrina” • Topics in the results: • Spatiotemporal patterns LLNL, Aug 15, 2007
Theme Life Cycles for Hurricane Katrina Oil Price price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203 fuel 0.0188 company 0.0182 … New Orleans city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227 evacuate 0.0211 storm 0.0177 … LLNL, Aug 15, 2007
Theme Snapshots for Hurricane Katrina Week2: The discussion moves towards the north and west Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week4: The theme is again strong along the east coast and the Gulf of Mexico Week5: The theme fades out in most states LLNL, Aug 15, 2007
Theme Life Cycles: KDD gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038… marketing 0.0087customer 0.0086 model 0.0079business 0.0048… rules 0.0142association 0.0064 support 0.0053… Global Themes life cycles of KDD Abstracts LLNL, Aug 15, 2007
Theme Evolution Graph: KDD 1999 2000 2001 2002 2003 2004 T web 0.009classifica –tion 0.007features0.006topic 0.005… SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005 … mixture 0.005random 0.006cluster 0.006clustering 0.005 variables 0.005… topic 0.010mixture 0.008LDA 0.006 semantic 0.005 … decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005 … … Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007 … Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004 … … … … LLNL, Aug 15, 2007
Blog Sentiment Summary (query=“Da Vinci Code”) LLNL, Aug 15, 2007
Results: Sentiment Dynamics Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) Facet: religious beliefs ( Bursts during the movie, Neg > Pos ) LLNL, Aug 15, 2007
Event Impact Analysis: IR Research Theme: retrieval models SIGIR papers term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Publication of the paper “A language modeling approach to information retrieval” 1992 year Starting of the TREC conferences xml 0.0678email 0.0197 model 0.0191collect 0.0187 judgment 0.0102 rank 0.0097 subtopic 0.0079 … vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236 boolean 0.0151 function 0.0123 feedback 0.0077 … 1998 model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268 probable 0.0205 smooth 0.0198 markov 0.0137 likelihood 0.0059 … probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … LLNL, Aug 15, 2007
Temporal-Author-Topic Analysis Jiawei Han Rakesh Agrawal close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353 threshold 0.0207 top-k 0.0176 fp-tree 0.0102 … index 0.0440graph 0.0343web 0.0307 gspan 0.0273substructure 0.0201 gindex 0.0164 bide 0.0115 xml 0.0109 … project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258 frequent 0.0181 closet 0.0176 prefixspan 0.0170 … Author Author A Global theme: frequent patterns time 2000 Author B pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360 pattern-growth 0.0203 constraint 0.0184 push 0.0138 … research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275 article 0.0258 revolution 0.0154 innovate 0.0154 … LLNL, Aug 15, 2007
Related Work • Specific Contextual Text Mining Problems • Multi-collection Comparative Mining (e.g., [Zhai et al. 04] • Spatiotemporal theme analysis (e.g., [Mei et al. 06]) • Author-topic analysis (e.g., [Steyvers et al. 04]) • … • Probabilistic topic models: • Probabilistic latent semantic analysis (PLSA) (e.g. [Hofmann 99]) • Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03]) LLNL, Aug 15, 2007
Conclusions • Defined a general text mining problem – contextual text mining • Proposed a general solution • Contextual probabilistic latent semantic analysis • Probabilistic labeling of topics • Many applications • Future work • Evaluation • Similar extension to LDA • Applications LLNL, Aug 15, 2007
References • CPLSA • Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06. • Labeling • Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings KDD'07 • Special cases: • C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04 • Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining, In Proceedings KDD' 05 • Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06 • Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07 LLNL, Aug 15, 2007
Research of the IR group @ UIUC Web, Email, and Bioinformatics Current focus Current focus • - Personalized • Retrieval models • Difficult queries • Comparative text mining Entity/Relation Extraction Search Applications Summarization Visualization Mining Applications Filtering Mining Information Organization Information Access Knowledge Acquisition Search Extraction Categorization Clustering Natural Language Content Analysis Text LLNL, Aug 15, 2007
The End Thank You! LLNL, Aug 15, 2007