1 / 96

Author-Topic Models for Large Text Corpora

This paper discusses the use of author-topic models to analyze large sets of documents, including probabilistic approaches and their applications in various domains. Results from CiteSeer, NIPS, and Enron data are presented. Future directions and a demo of an author-topic query tool are also provided.

capucine
Download Presentation

Author-Topic Models for Large Text Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Author-Topic Models for Large Text Corpora Padhraic SmythDepartment of Computer Science University of California, Irvine In collaboration with: Mark Steyvers (UCI) Michal Rosen-Zvi (UCI) Tom Griffiths (Stanford)

  2. Outline • Problem motivation: • Modeling large sets of documents • Probabilistic approaches • topic models -> author-topic models • Results • Author-topic results from CiteSeer, NIPS, Enron data • Applications of the model • (Demo of author-topic query tool) • Future directions

  3. Data Sets of Interest • Data = set of documents • Large collection of documents: 10k, 100k, etc • Know authors of the documents • Know years/dates of the documents • …… • (will typically assume bag of words representation)

  4. Examples of Data Sets • CiteSeer: • 160k abstracts, 80k authors, 1986-2002 • NIPS papers • 2k papers, 1k authors, 1987-1999 • Reuters • 20k newspaper articles, 114 authors

  5. Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com

  6. Enron email data 500,000 emails 5000 authors 1999-2002

  7. Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…..

  8. A topic is represented as a (multinomial) distribution over words

  9. Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

  10. Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval P(probabilistic | topic) = 0.25 P(learning | topic) = 0.50 P(Bayesian | topic) = 0.25 P(other words | topic) = 0.00 P(information | topic) = 0.5 P(retrieval | topic) = 0.5 P(other words | topic) = 0.0

  11. Graphical Model z Cluster Variable w Word n words

  12. Graphical Model z Cluster Variable w Word n words D documents

  13. Graphical Model Cluster Weights a z Cluster Variable f Cluster-Word distributions w Word n words D documents

  14. Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  15. Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

  16. Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  17. History of topic models • Latent class models in statistics (late 60’s) • Hoffman (1999) • Original application to documents • Blei, Ng, and Jordan (2001, 2003) • Variational methods • Griffiths and Steyvers (2003, 2004) • Gibbs sampling approach (very efficient)

  18. Word/Document countsfor 16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data?

  19. Example of Gibbs Sampling • Assign word tokens randomly to topics: (●=topic 1; ●=topic 2 )

  20. After 1 iteration • Apply sampling equation to each word token

  21. After 4 iterations

  22. After 32 iterations  ● ●

  23. Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  24. Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval

  25. Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval

  26. Approach • The author-topic model • a probabilistic model linking authors and topics • authors -> topics -> words • learned from data • completely unsupervised, no labels • generative model • Different questions or queries can be answered by appropriate probability calculus • E.g., p(author | words in document) • E.g., p(topic | author)

  27. Graphical Model x Author z Topic

  28. Graphical Model x Author z Topic w Word

  29. Graphical Model x Author z Topic w Word n

  30. Graphical Model a x Author z Topic w Word n D

  31. Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D

  32. Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • A1 has multinomial topic distribution q1 • A2 has multinomial topic distribution q2 • For each word in the paper: • Sample an author x (uniformly) from A1,A2 • Sample a topic z from qX • Sample a word w from a multinomial topic distribution z

  33. Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D

  34. Learning • Observed • W = observed words, A = sets of known authors • Unknown • x, z : hidden variables • Θ,  : unknown parameters • Interested in: • p( x, z | W, A) • p( θ ,  | W, A) • But exact inference is not tractable

  35. Step 1: Gibbs sampling of x and z a x Author q Marginalize over unknown parameters z Topic f w Word n D

  36. Step 2: MAP estimates of θand  a x Author Condition on particular samples of x and z q z Topic f w Word n D

  37. Step 2: MAP estimates of θand  a x Author q Point estimates of unknown parameters z Topic f w Word n D

  38. More Details on Learning • Gibbs sampling for x and z • Typically run 2000 Gibbs iterations • 1 iteration = full pass through all documents • Estimating θand  • x and z sample -> point estimates • non-informative Dirichlet priors forθand  • Computational Efficiency • Learning is linear in the number of word tokens  • Predictions on new documents • can average over θand  (from different samples, different runs)

  39. Gibbs Sampling • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

  40. Experiments on Real Data • Corpora • CiteSeer: 160K abstracts, 85K authors • NIPS: 1.7K papers, 2K authors • Enron: 115K emails, 5K authors (sender) • Pubmed: 27K abstracts, 50K authors • Removed stop words; no stemming • Ignore word order, just use word counts • Processing time: Nips: 2000 Gibbs iterations  8 hours CiteSeer: 2000 Gibbs iterations  4 days

  41. Four example topics from CiteSeer (T=300)

  42. More CiteSeer Topics

  43. Some topics relate to generic word usage

  44. What can the Model be used for? • We can analyze our document set through the “topic lens” • Applications • Queries • Who writes on this topic? • e.g., finding experts or reviewers in a particular area • What topics does this person do research on? • Discovering trends over time • Detecting unusual papers and authors • Interactive browsing of a digital library via topics • Parsing documents (and parts of documents) by topic • and more…..

  45. Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

  46. Temporal patterns in topics: hot and cold topics • We have CiteSeer papers from 1986-2002 • For each year, calculate the fraction of words assigned to each topic • -> a time-series for topics • Hot topics become more prevalent • Cold topics become less prevalent

More Related