1 / 18

Semantic History Embedding in Online Generative Topic Models

Semantic History Embedding in Online Generative Topic Models. Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu) Carlotta Domeniconi ( carlotta@cs.gmu.edu) Department of Computer Science George Mason University SDM 2009. Outline.

hollie
Download Presentation

Semantic History Embedding in Online Generative Topic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic History Embedding in OnlineGenerative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait (lalsumai@gmu.edu) Daniel Barbará (dbarbara@gmu.edu) Carlotta Domeniconi (carlotta@cs.gmu.edu) Department of Computer Science George Mason University SDM 2009

  2. Outline • Introduction and related work • Online LDA (OLDA) • Parameter Generation • Sliding history window • Contribution weights • Experiments • Conclusion and future work

  3. Introduction • When a topic is observed at a certain time, it is more likely to appear in the future • previously discovered topics hold important information about the underlying structure of data • Incorporating such information in future knowledge discovery can enhance the inferred topics

  4. Related Work • Q. Sun, R. Li et al. ACL 2008. • LDA-based Fisher kernel to measure the text semantic similarity between blocks of LDA documents • X. Wang et al. ICDM 2007 • Topical N-Gram model that automatically identified feasible N-grams based on the context that surround it • X. Phan et al. IW3C2 2008. • a classifier on both a small set of labeled documents in addition to an LDA topic model estimated from Wikipedia.

  5. Time (time between t & t+1 = ε) t t+1 t+1  t t+1 Topic Evolution Tracking PriorsConstruction Emerging TopicDetection t+1 t+1  t  t  t+1 t+1 t zti zit+1 S t+ 1 St wti wit+1 Nd Nd M t Mt+1 Tracking Topics Emerging Topic List t+1 t  t t+1 K K Online LDA (OLDA)

  6. Inference Process • Parameter Generation • Simple inference problem • Gibbs Sampling Current stream Historic observations Current stream Historic observations

  7. Topic Evolution Tracking • Topic alignment over time • Handles changes in lexicon, topic drift Time t t+1 Aligned topicsover time P(topic) P(word|topic)

  8. Sliding History Window • Consider all topic-word distributions within a “sliding history window” (δ) • Alternatives for keeping track of history at time t • full memory, δ= t • short memory, δ=1 • Intermediate memory, δ= c • Matrix Evolution Matrix Topic distribution over time Dictionary

  9. Contribution Control • Evolution Tuning Parameters ω • Individual weights of models • Decaying history: ω1 < ω2<…< ωδ • Equal contributions: ω1 = ω2=…= ωδ • Total weight of history (vs. weight of new observations) • Balanced weights (sum=1) • Biased toward the past (sum>1) • Biased toward the future (sum<1)

  10. Parameter Generation • Priors of Topic distribution over words at time t+1 • Generate topic distribution

  11. Experimental Design • “Matlab Topic Modeling Toolbox”, by Mark Steyvers and Tom Griffiths • Datasets: • NIPS • Proceedings from 1988-2000 • 1,740 papers, 13,649 unique words, 2,301,375 word tokens • 13 streams, size from 90 to 250 doc’s per stream • Reuters-21578 • News from 26-FEB-1987 to 19-OCT-1987 • 10,337 documents; 12,112 unique words; 793,936 word tokens • 30 streams (29/340 doc’s, 1/517 doc’s) • Baselines: • OLDAfixed: no memory • OLDA (ω(1) ): short memory • Performance Evaluation • measure: Perplexity • Test set: documents of next year or stream

  12. ReutersOLDA with fixed β vs. OLDA with semantic β No memory

  13. ReutersOLDA with different window size and weights • Increasing window size enhanced prediction • Incremental history information (δ>1,sum>1) did not improve topic estimation at all Incremental History Information short memory Equal contribution Increase window size

  14. NIPSOLDA with Different Window • Increasing window size enhanced prediction w.r.t. short memory • Window size greater than 3 enhanced prediction • Effect of total weight Short memory No memory

  15. NIPSOLDA with Different Total Weight • Models with lower total weight resulted in better prediction No memory Sum of weight = 1 Decrease sum of weights

  16. NIPS & ReutersOLDA with Different Total Weight • Variable sum(ω) • δ= 2 Decrease total sum of weights Increase total sum of weights

  17. NIPSOLDA with Equal vs Decaying History Contribution

  18. Conclusions • the effect of embedding semantic information in LDA topic modeling of text streams • Parameter generation based on topical structures inferred in the past • Semantic embedding enhances OLDA prediction • Effect of • Total influence of history, • History window size, and • Equal or decaying contributions • Future work • use of prior-knowledge • effect of embedded historic semantics on detecting emerging and/or periodic topics

More Related