460 likes | 801 Views
Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models. Michael Paul and Roxana Girju. Outline. Overview of topic models Cross-Collection LDA Cross-cultural analysis with ccLDA Other applications of ccLDA Model evaluation An alternative cross-collection model.
E N D
Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model
Outline • Overview of topic models • PLSI and LDA • Some slides borrowed from CS410 – ChengXiangZhai • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model
Probabilistic Topic Models • Idea: each document is some mix of topics • Each word in the document belongs to a topic
Document as a Sample of Mixed Topics [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … government 0.3 response 0.2... Topic 1 • Applications of topic models: • Summarize themes/aspects • Facilitate navigation/browsing • Retrieve documents • Segment documents • Many others • How can we discover these topic word distributions? city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic k is 0.05the 0.04a 0.03 ... Background B
Probabilistic Latent Semantic Indexing[Hofmann, 1999] • Each token in a document is associated with 2 variables: • a word w (observable) • a topic z (hidden) • P(w,z|d) = P(z|d) P(w|z)
PLSA as a Mixture Model Document d warning 0.3 system 0.2.. ? Topic 1 d,1 ? 1 “Generating” word w in doc d in the collection 2 aid 0.1donation 0.05support 0.02 .. ? Topic 2 d,2 1 - B ? ? d, k W k … statistics 0.2loss 0.1dead 0.05 .. ? B ? Topic k ? B is 0.05the 0.04a 0.03 .. ? ? Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood ?
M-Step: Max. Likelihood Estimator based on “fractional counts” How to Estimate Multiple Topics?(Expectation Maximization) the 0.2 a 0.1 we 0.01 to 0.02 … E-Step: Predict topic labels using Bayes Rule Known Background p(w | B) Observed Doc(s) … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? …
PLSI - Problems • Each document is represented as a dummy variable d • Number of parameters grows linearly with corpus size • Overfitting • Not fully generative • Not clear how to model previously unseen documents
Latent Dirichlet Allocation[Blei et al, 2003] • Per-document topic mixtures and word multinomials come from Dirichlet priors • Exact solution is intractable • Inference is more complicated • Variational methods • Monte Carlo
Dirichlet Distribution • Conjugate prior of multinomial distribution
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model
Cross-Collection LDA (ccLDA) • LDA extension for modeling multiple text collections • Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection • Automatically discovers differences between collections and organizes them by topic
Example • Topic of weather and the outdoors in travel forums
ccLDA Graphical representation: The generative process: • Inference can be done with Gibbs sampling αφβ CT θ z w c x D γ0 ψσ δ γ1 TC N
Previous Work • Comparative mixture model (CCMix) • ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining.Proceedings of ACM KDD 2004. • Improvements in ccLDA: • Does not rely on user-defined parameters • Distributions have Dirichlet/Beta priors • Document-topic distributions have collection-dependent priors • P(x) depends on the topic and collection
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model
Cross-Cultural Analysis • Documents from or about 3 countries: • United Kingdom • India • Singapore • 3,266 forum discussions • collected from lonelyplanet.com • represents the perspective of tourists • 7,388 English-language blogs • collected through blogcatalog.com • represents the perspective of locals
Cross-Cultural Analysis • Topic of religion from the blogs
Cross-Cultural Analysis • Topic of entertainment from the blogs • Compare against ccMix
Cross-Cultural Analysis • Topic of travel from the blogs • Compare against LDA (on each collection individually)
Cross-Cultural Analysis • Topic of food from both datasets • Compare the view of tourists and locals
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Scientific research/literature analysis • Media analysis and bias detection • Model evaluation • An alternative cross-collection model
Research Analysis • 16,186 abstracts from computational linguistics and linguistics journals • Interdisciplinary research topic discovery • Topic evolution over time
Research Analysis • Topic of communication
Research Analysis • Topic of parsing/grammars across two time intervals
Media Analysis • 623 news articles from msnbc.com and foxnews.com from August 2008 • Discover editorial differences within topics
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model
Model Evaluation • Greater likelihood of held-out data than alternative models
Model Evaluation • Document classification – new vs old • Compare to NB and SVM (linear kernel)
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model
Alternative Model • Similar to hierarchical Pachinko Allocation [Mimno et al, 2007] • Model as 2-level hierarchy
Alternative Model • Single, global set of “super-topics” • One set of “sub-topics” for each collection • Choose super-topic T from P(T|d) • Choose sub-topic t from P(t|T,c) • Choose hierarchy level l from P(l|t,T) • if l = 0, choose word from P(w|T)else if l = 1, choose word from P(w|t)
Alternative Model • This is just a generalization of ccLDA! • ccLDA = special case,constrained such that for each super-topic T=jthere is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j
Alternative Model • Topic of religion in the blogs 0.970483
Alternative Model • Topic of religion in the blogs 0.984414
Alternative Model • Topic of religion in the blogs 0.851749 0.102534
ccLDA • Topic of religion from the blogs
Alternative Model • Topic of politicsin the blogs 0.29108 0.699227
Alternative Model • Topic of politics in the blogs 0.987059
Alternative Model • Topic of politics in the blogs 0.970675
ccLDA • Topic of politics from the blogs
Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model