Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models Michael Paul and Roxana Girju

Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

Outline • Overview of topic models • PLSI and LDA • Some slides borrowed from CS410 – ChengXiangZhai • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Model evaluation • An alternative cross-collection model

Probabilistic Topic Models • Idea: each document is some mix of topics • Each word in the document belongs to a topic

Document as a Sample of Mixed Topics [ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. … government 0.3 response 0.2... Topic 1 • Applications of topic models: • Summarize themes/aspects • Facilitate navigation/browsing • Retrieve documents • Segment documents • Many others • How can we discover these topic word distributions? city 0.2new 0.1orleans 0.05 ... Topic 2 … donate 0.1relief 0.05help 0.02 ... Topic k is 0.05the 0.04a 0.03 ... Background B

Probabilistic Latent Semantic Indexing[Hofmann, 1999] • Each token in a document is associated with 2 variables: • a word w (observable) • a topic z (hidden) • P(w,z|d) = P(z|d) P(w|z)

PLSA as a Mixture Model Document d warning 0.3 system 0.2.. ? Topic 1 d,1 ? 1 “Generating” word w in doc d in the collection 2 aid 0.1donation 0.05support 0.02 .. ? Topic 2 d,2 1 - B ? ? d, k W k … statistics 0.2loss 0.1dead 0.05 .. ? B ? Topic k ? B is 0.05the 0.04a 0.03 .. ? ? Background B Parameters: B=noise-level (manually set) ’s and ’s are estimated with Maximum Likelihood ?

M-Step: Max. Likelihood Estimator based on “fractional counts” How to Estimate Multiple Topics?(Expectation Maximization) the 0.2 a 0.1 we 0.01 to 0.02 … E-Step: Predict topic labels using Bayes Rule Known Background p(w | B) Observed Doc(s) … text =? mining =? association =? word =? … Unknown topic model p(w|1)=? “Text mining” … Unknown topic model p(w|2)=? “informationretrieval” … information =? retrieval =? query =? document =? …

PLSI - Problems • Each document is represented as a dummy variable d • Number of parameters grows linearly with corpus size • Overfitting • Not fully generative • Not clear how to model previously unseen documents

Latent Dirichlet Allocation[Blei et al, 2003] • Per-document topic mixtures and word multinomials come from Dirichlet priors • Exact solution is intractable • Inference is more complicated • Variational methods • Monte Carlo

Dirichlet Distribution • Conjugate prior of multinomial distribution

Latent Dirichlet Allocation

Cross-Collection LDA (ccLDA) • LDA extension for modeling multiple text collections • Each topic has a probability distribution that is shared among all collections as well as word distributions that are unique to each collection • Automatically discovers differences between collections and organizes them by topic

Example • Topic of weather and the outdoors in travel forums

ccLDA Graphical representation: The generative process: • Inference can be done with Gibbs sampling αφβ CT θ z w c x D γ0 ψσ δ γ1 TC N

Previous Work • Comparative mixture model (CCMix) • ChengXiang Zhai, Atulya Velivelli, Bei Yu. A cross-collection mixture model for comparative text mining.Proceedings of ACM KDD 2004. • Improvements in ccLDA: • Does not rely on user-defined parameters • Distributions have Dirichlet/Beta priors • Document-topic distributions have collection-dependent priors • P(x) depends on the topic and collection

Cross-Cultural Analysis • Documents from or about 3 countries: • United Kingdom • India • Singapore • 3,266 forum discussions • collected from lonelyplanet.com • represents the perspective of tourists • 7,388 English-language blogs • collected through blogcatalog.com • represents the perspective of locals

Cross-Cultural Analysis • Topic of religion from the blogs

Cross-Cultural Analysis • Topic of entertainment from the blogs • Compare against ccMix

Cross-Cultural Analysis • Topic of travel from the blogs • Compare against LDA (on each collection individually)

Cross-Cultural Analysis • Topic of food from both datasets • Compare the view of tourists and locals

Outline • Overview of topic models • Cross-Collection LDA • Cross-cultural analysis with ccLDA • Other applications of ccLDA • Scientific research/literature analysis • Media analysis and bias detection • Model evaluation • An alternative cross-collection model

Research Analysis • 16,186 abstracts from computational linguistics and linguistics journals • Interdisciplinary research topic discovery • Topic evolution over time

Research Analysis • Topic of communication

Research Analysis • Topic of parsing/grammars across two time intervals

Media Analysis • 623 news articles from msnbc.com and foxnews.com from August 2008 • Discover editorial differences within topics

Model Evaluation • Greater likelihood of held-out data than alternative models

Model Evaluation • Document classification – new vs old • Compare to NB and SVM (linear kernel)

Alternative Model • Similar to hierarchical Pachinko Allocation [Mimno et al, 2007] • Model as 2-level hierarchy

Alternative Model • Single, global set of “super-topics” • One set of “sub-topics” for each collection • Choose super-topic T from P(T|d) • Choose sub-topic t from P(t|T,c) • Choose hierarchy level l from P(l|t,T) • if l = 0, choose word from P(w|T)else if l = 1, choose word from P(w|t)

Alternative Model • This is just a generalization of ccLDA! • ccLDA = special case,constrained such that for each super-topic T=jthere is exactly one sub-topic such that P(t=j|T=j)=1 and P(t=i|T=j)=0 for all i ≠ j

Alternative Model • Topic of religion in the blogs 0.970483

Alternative Model • Topic of religion in the blogs 0.984414

Alternative Model • Topic of religion in the blogs 0.851749 0.102534

ccLDA • Topic of religion from the blogs

Alternative Model • Topic of politicsin the blogs 0.29108 0.699227

Alternative Model • Topic of politics in the blogs 0.987059

Alternative Model • Topic of politics in the blogs 0.970675

ccLDA • Topic of politics from the blogs

Questions?

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models

Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models

Presentation Transcript

Mixed Analysis of Variance Models with SPSS

Mixed Analysis of Variance Models with SPSS

Mixed Models

CULTURAL MODELS of NATURE

Topic 3 Leadership Cross-Cultural Management

Mixed Analysis of Variance Models with SPSS

Issues with Mixed Models

Students’ positionings with respect to cultural models of mathematics: a socio-cultural analysis

Mixed models

Blogs, Forums, Wikis

Sampling, WLS, and Mixed Models

Topic-Dependent Sentiment Analysis of Financial Blogs

“Mixed layer” models

CULTURAL MODELS of NATURE

Mixed Linear Models

Cross Cultural

Mixed Linear Models

Mixed Linear Models