180 likes | 383 Views
Semantic History Embedding in Online Generative Topic Models. Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu) Carlotta Domeniconi ( carlotta@cs.gmu.edu) Department of Computer Science George Mason University SDM 2009. Outline.
E N D
Semantic History Embedding in OnlineGenerative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu) Carlotta Domeniconi (carlotta@cs.gmu.edu) Department of Computer Science George Mason University SDM 2009
Outline • Introduction and related work • Online LDA (OLDA) • Parameter Generation • Sliding history window • Contribution weights • Experiments • Conclusion and future work
Introduction • When a topic is observed at a certain time, it is more likely to appear in the future • previously discovered topics hold important information about the underlying structure of data • Incorporating such information in future knowledge discovery can enhance the inferred topics
Related Work • Q. Sun, R. Li et al. ACL 2008. • LDA-based Fisher kernel to measure the text semantic similarity between blocks of LDA documents • X. Wang et al. ICDM 2007 • Topical N-Gram model that automatically identified feasible N-grams based on the context that surround it • X. Phan et al. IW3C2 2008. • a classifier on both a small set of labeled documents in addition to an LDA topic model estimated from Wikipedia.
Time (time between t & t+1 = ε) t t+1 t+1 t t+1 Topic Evolution Tracking PriorsConstruction Emerging TopicDetection t+1 t+1 t t t+1 t+1 t zti zit+1 S t+ 1 St wti wit+1 Nd Nd M t Mt+1 Tracking Topics Emerging Topic List t+1 t t t+1 K K Online LDA (OLDA)
Inference Process • Parameter Generation • Simple inference problem • Gibbs Sampling Current stream Historic observations Current stream Historic observations
Topic Evolution Tracking • Topic alignment over time • Handles changes in lexicon, topic drift Time t t+1 Aligned topicsover time P(topic) P(word|topic)
Sliding History Window • Consider all topic-word distributions within a “sliding history window” (δ) • Alternatives for keeping track of history at time t • full memory, δ= t • short memory, δ=1 • Intermediate memory, δ= c • Matrix Evolution Matrix Topic distribution over time Dictionary
Contribution Control • Evolution Tuning Parameters ω • Individual weights of models • Decaying history: ω1 < ω2<…< ωδ • Equal contributions: ω1 = ω2=…= ωδ • Total weight of history (vs. weight of new observations) • Balanced weights (sum=1) • Biased toward the past (sum>1) • Biased toward the future (sum<1)
Parameter Generation • Priors of Topic distribution over words at time t+1 • Generate topic distribution
Experimental Design • “Matlab Topic Modeling Toolbox”, by Mark Steyvers and Tom Griffiths • Datasets: • NIPS • Proceedings from 1988-2000 • 1,740 papers, 13,649 unique words, 2,301,375 word tokens • 13 streams, size from 90 to 250 doc’s per stream • Reuters-21578 • News from 26-FEB-1987 to 19-OCT-1987 • 10,337 documents; 12,112 unique words; 793,936 word tokens • 30 streams (29/340 doc’s, 1/517 doc’s) • Baselines: • OLDAfixed: no memory • OLDA (ω(1) ): short memory • Performance Evaluation • measure: Perplexity • Test set: documents of next year or stream
ReutersOLDA with different window size and weights • Increasing window size enhanced prediction • Incremental history information (δ>1,sum>1) did not improve topic estimation at all Incremental History Information short memory Equal contribution Increase window size
NIPSOLDA with Different Window • Increasing window size enhanced prediction w.r.t. short memory • Window size greater than 3 enhanced prediction • Effect of total weight Short memory No memory
NIPSOLDA with Different Total Weight • Models with lower total weight resulted in better prediction No memory Sum of weight = 1 Decrease sum of weights
NIPS & ReutersOLDA with Different Total Weight • Variable sum(ω) • δ= 2 Decrease total sum of weights Increase total sum of weights
Conclusions • the effect of embedding semantic information in LDA topic modeling of text streams • Parameter generation based on topical structures inferred in the past • Semantic embedding enhances OLDA prediction • Effect of • Total influence of history, • History window size, and • Equal or decaying contributions • Future work • use of prior-knowledge • effect of embedded historic semantics on detecting emerging and/or periodic topics