Modeling Documents

Modeling Documents Amruta Joshi Department of Computer Science Stanford University Research in Algorithms for the InterNet

Outline • Topic Models • Topic Extraction2 • Author Information • Modeling Topics • Modeling Authors • Author Topic Model • Inference • Integrating topics and syntax • Probabilistic Models • Composite Model • Inference Research in Algorithms for the InterNet

Motivation • Identifying content of a document • Identifying its latent structure • More specifically • Given a collection of documents we want to create a model to collect information about • Authors • Topics • Syntactic constructs Research in Algorithms for the InterNet

Topics & Authors • Why model topics? • Observe topic trends • How documents relate to one-another • Tagging abstracts • Why model authors’ interests? • Identifying what author writes about • Identifying authors with similar interests • Authorship attribution • Creating reviewer lists • Finding unusual work by an author Research in Algorithms for the InterNet

rivers In floods, the banks of a river overflow Topic Extraction: Overview • Supervised Learning Techniques • Learn from labeled document collection • But Unlabeled documents, Rapidly changing fields (Yang 1998) Research in Algorithms for the InterNet

Topic Extraction: Overview • Dimensionality Reduction • Represent documents in Vector Space of terms • Map to low-dimensionality • Non-linear dim. reduction • WEBSOM (Lagus et. al. 1999) • Linear Projection • LSI (Berry, Dumais, O’Brien 1995) • Regions represent topics Research in Algorithms for the InterNet

Topic Extraction: Overview • Cluster documents on semantic content • Typically, each cluster has just 1 topic • Aspect Model • Topic modeled as distribution over words • Documents generated from multiple topics Research in Algorithms for the InterNet

As doth the lion in the Capitol, A man no mightier than thyself or me … Author Information: Overview • Analyzing text using • Stylometry • statistical analysis using literary style, frequency of word usage, etc • Semantics • Content of document Research in Algorithms for the InterNet

D1 D2 D3 D4 Author Information: Overview • Graph-based models • Build Interactive ReferralWeb using citations • Kautz, Selman, Shah 1997 • Build Co-Author Graphs • White & Smith • Page-Rank for analysis Research in Algorithms for the InterNet

The Big Idea • Topic Model • Model topics as distribution over words • Author Model • Model author as distribution over words • Author-Topic Model • Probabilistic Model for both • Model topics as distribution over words • Model authors as distribution over topics Research in Algorithms for the InterNet

Pneumonia Tuberculosis Lung Infiltrates XRay Sputum Smear Bayesian Networks nodes = random variables edges = direct probabilistic influence Topology captures independence: XRay conditionally independent of Pneumonia given Infiltrates Slide Credit: Lisa Getoor, UMD College Park Research in Algorithms for the InterNet

Pneumonia Tuberculosis P T P(I |P, T ) p t 0.7 0.3 p t 0.6 0.4 Lung Infiltrates p t 0.2 0.8 p t 0.01 0.99 XRay Sputum Smear Bayesian Networks • Associated with each node Xi there is a conditional probability distribution P(Xi|Pai:) — distribution over Xi for each assignment to parents • If variables are discrete, P is usually multinomial • Pcan be linear Gaussian, mixture of Gaussians, … Slide Credit: Lisa Getoor, UMD College Park Research in Algorithms for the InterNet

BN models can be learned from empirical data parameter estimation via numerical optimization structure learning via combinatorial search. P T Inducer I Data X S BN Learning Slide Credit: Lisa Getoor, UMD College Park Research in Algorithms for the InterNet

Probabilistic Generative Process Statistical Inference Generative Model Mixture weights Mixture components Bayesian approach: use priors Mixture weights ~ Dirichlet( a ) Mixture components ~ Dirichlet( b ) Research in Algorithms for the InterNet

Doc 1 Z W   Z   W  … TT T1 T2 w1 w2 wv … Bayesian Network for modeling document generation Research in Algorithms for the InterNet

Document specific distribution over topics Document   z  Topic Topic distribution over words Word  w Nd T D Topic Model: Plate Notation Research in Algorithms for the InterNet

Topic Model: Geometric Representation Research in Algorithms for the InterNet

Uniform distribution over authors of doc Document x  Author Distribution of authors over words Word  w ad Nd A D Modeling Authors with words Research in Algorithms for the InterNet

Uniform distribution of documents over authors Document x  z  Distribution of authors over topics Topic Author Word Topic distribution over words   w ad Nd T A D Author-Topic Model Research in Algorithms for the InterNet

Inference • Expectation Maximization • But poor results (local Maxima) • Gibbs Sampling • Parameters: ,  • Start with initial random assignment • Update parameter using other parameters • Converges after ‘n’ iterations • Burn-in time Research in Algorithms for the InterNet

# of times topic j has occurred in document d Prob. that ith topic is assigned to topic j keeping other topic assn unchanged # of times word m is assigned to topic j Inference and Learning for Documents mj dj Research in Algorithms for the InterNet

Matrix Factorization Research in Algorithms for the InterNet

Topic Model: Inference River Stream Bank Money Loan documents Can we recover the original topics and topic mixtures from this data? Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

Example of Gibbs Sampling • Assign word tokens randomly to topics (●=topic 1; ●=topic 2 ) River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

After 1 iteration • Apply sampling equation to each word token River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

After 4 iterations River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

After 32 iterations ● ● River Stream Bank Money Loan Slide Credit: Padhraic Smyth, UC Irvine Research in Algorithms for the InterNet

Results • Tested on Scientific Papers • NIPS Dataset • V=13,649 D=1,740 K=2,037 • #Topics = 100 • #tokens = 2,301,375 • CiteSeer Dataset • V=30,799 D=162,489 K=85,465 • #Topics = 300 • #tokens = 11,685,514 Research in Algorithms for the InterNet

Lower the better Evaluating Predictive Power • Perplexity • Indicates ability to predict words on new unseen documents Research in Algorithms for the InterNet

Results: Perplexity Research in Algorithms for the InterNet

Recap • First • Author Model • Topic Model • Then • Author-Topic Model • Next… • Integrating Topics & Syntax Research in Algorithms for the InterNet

Integrating topics & syntax • Probabilistic Models • Short-range dependencies • Syntactic Constraints • Represented as distinct syntactic classes • HMM, Probabilistic CFGs • Long-range dependencies • Semantic Constraints • Represented as probabilistic distribution • Bayes Model, Topic Model • New Idea! Use both Research in Algorithms for the InterNet

How to integrate these? • Mixture of Models • Each word exhibits either short or long range dependencies • Product of Models • Each word exhibits both short or long range dependencies • Composite Model • Asymmetric • All words exhibit short-range dependencies • Subset of words exhibit long-range dependencies Research in Algorithms for the InterNet

The Composite Model 1 • Capturing asymmetry • Replace probability distribution over words with semantic model • Syntactic model chooses when to emit content word • Semantic model chooses which word to emit • Methods • Syntactic component is HMM • Semantic component is Topic model Research in Algorithms for the InterNet

0.2 0.5 0.4 0.1 network neural output networks ... image images object objects ... kernel support svm vector ... 0.9 0.7 0.9 used trained obtained described ... in with for on ... Generating phrases networkused forimagesimage obtained with kerneloutputdescribed withobjectsneural networktrained withsvm images Research in Algorithms for the InterNet

Doc’s distribution over topics z4 z1 z2 z3 Classes Topics   w4 w3 w1 w2 Words c3 c2 c1 c4 The Composite Model 2 (Graphical) Research in Algorithms for the InterNet

The Composite Model 3 • (d) : document’s distribution over topics • Transitions between classes ci-1 and ci follow distribution (Ci-1) • A document is generated as: • For each word wi in document d • Draw zi from (d) • Draw ci from (Ci-1) • If ci=1, then draw wi from (zi), • else draw wi from (ci) Research in Algorithms for the InterNet

Results • Tested on • Brown corpus (tagged with word types) • Concatenated Brown & TASA corpus • HMM & Topic Model • 20 Classes • start/end Markers Class + 19 classes • T = 200 Research in Algorithms for the InterNet

Results • Identifying Syntactic classes & semantic topics • Clean separation observed • Identifying function words & content words • “control” : plain verb (syntax) or semantic word • Part-of-Speech Tagging • Identifying syntactic class • Document Classification • Brown corpus: 500 docs => 15 groups • Results similar to plain Topic Model Research in Algorithms for the InterNet

Extensions to Topic Model • Integrating link information (Cohn, Hofmann 2001) • Learning Topic Hierarchies • Integrating Syntax & Topics • Integrate authorship info with content (author-topic model) • Grade-of-membership Models • Random sentence generation Research in Algorithms for the InterNet

Conclusion • Identifying its latent structure • Document Content is modeled for • Semantic Associations – topic model • Authorship - author topic model • Syntactic Constructs – HMM Research in Algorithms for the InterNet

Acknowledgements • Prof. Rajeev Motwani • Advice and guidance regarding topic selection • T. K. Satish Kumar • Help on Probabilistic Models Research in Algorithms for the InterNet

Thank you! Research in Algorithms for the InterNet

References • Primary • Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic Author-Topic Models for Information Discovery. The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington. • Steyvers, M. & Griffiths, T. Probabilistic topic models. (http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf) • Rosen-Zvi, M., Griffiths T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents. In 20th Conference on Uncertainty in Artificial Intelligence. Banff, Canada • Griffiths, T.L., & Steyvers, M., Blei, D.M., & Tenenbaum, J.B. (in press). Integrating Topics and Syntax. In: Advances in Neural Information Processing Systems, 17. • Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. Research in Algorithms for the InterNet

Modeling Documents

Modeling Documents

Presentation Transcript

Documents

Documents

Documents

Documents

Documents

Documents

Documents

Documents

DOCUMENTS

Documents

Documents

Documents

Documents

Documents

Documents

Documents

Documents

DOCUMENTS

DOCUMENTS

Documents

Documents

Modeling Framework and Supporting System for Process Assessment Documents