960 likes | 976 Views
This paper discusses the use of author-topic models to analyze large sets of documents, including probabilistic approaches and their applications in various domains. Results from CiteSeer, NIPS, and Enron data are presented. Future directions and a demo of an author-topic query tool are also provided.
E N D
Author-Topic Models for Large Text Corpora Padhraic SmythDepartment of Computer Science University of California, Irvine In collaboration with: Mark Steyvers (UCI) Michal Rosen-Zvi (UCI) Tom Griffiths (Stanford)
Outline • Problem motivation: • Modeling large sets of documents • Probabilistic approaches • topic models -> author-topic models • Results • Author-topic results from CiteSeer, NIPS, Enron data • Applications of the model • (Demo of author-topic query tool) • Future directions
Data Sets of Interest • Data = set of documents • Large collection of documents: 10k, 100k, etc • Know authors of the documents • Know years/dates of the documents • …… • (will typically assume bag of words representation)
Examples of Data Sets • CiteSeer: • 160k abstracts, 80k authors, 1986-2002 • NIPS papers • 2k papers, 1k authors, 1987-1999 • Reuters • 20k newspaper articles, 114 authors
Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com
Enron email data 500,000 emails 5000 authors 1999-2002
Problems of Interest • What topics do these documents “span”? • Which documents are about a particular topic? • How have topics changed over time? • What does author X write about? • Who is likely to write about topic Y? • Who wrote this specific document? • and so on…..
A topic is represented as a (multinomial) distribution over words
Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval
Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval P(probabilistic | topic) = 0.25 P(learning | topic) = 0.50 P(Bayesian | topic) = 0.25 P(other words | topic) = 0.00 P(information | topic) = 0.5 P(retrieval | topic) = 0.5 P(other words | topic) = 0.0
Graphical Model z Cluster Variable w Word n words
Graphical Model z Cluster Variable w Word n words D documents
Graphical Model Cluster Weights a z Cluster Variable f Cluster-Word distributions w Word n words D documents
Cluster Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval
Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval
Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval
History of topic models • Latent class models in statistics (late 60’s) • Hoffman (1999) • Original application to documents • Blei, Ng, and Jordan (2001, 2003) • Variational methods • Griffiths and Steyvers (2003, 2004) • Gibbs sampling approach (very efficient)
Word/Document countsfor 16 Artificial Documents documents Can we recover the original topics and topic mixtures from this data?
Example of Gibbs Sampling • Assign word tokens randomly to topics: (●=topic 1; ●=topic 2 )
After 1 iteration • Apply sampling equation to each word token
After 32 iterations ● ●
Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval
Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval
Author-Topic Models DOCUMENT 1 DOCUMENT 2 Probabilistic Information Learning Retrieval Learning Information Bayesian Retrieval DOCUMENT 3 Probabilistic Learning Information Retrieval
Approach • The author-topic model • a probabilistic model linking authors and topics • authors -> topics -> words • learned from data • completely unsupervised, no labels • generative model • Different questions or queries can be answered by appropriate probability calculus • E.g., p(author | words in document) • E.g., p(topic | author)
Graphical Model x Author z Topic
Graphical Model x Author z Topic w Word
Graphical Model x Author z Topic w Word n
Graphical Model a x Author z Topic w Word n D
Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D
Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • A1 has multinomial topic distribution q1 • A2 has multinomial topic distribution q2 • For each word in the paper: • Sample an author x (uniformly) from A1,A2 • Sample a topic z from qX • Sample a word w from a multinomial topic distribution z
Graphical Model a x Author Author-Topic distributions q z Topic f Topic-Word distributions w Word n D
Learning • Observed • W = observed words, A = sets of known authors • Unknown • x, z : hidden variables • Θ, : unknown parameters • Interested in: • p( x, z | W, A) • p( θ , | W, A) • But exact inference is not tractable
Step 1: Gibbs sampling of x and z a x Author q Marginalize over unknown parameters z Topic f w Word n D
Step 2: MAP estimates of θand a x Author Condition on particular samples of x and z q z Topic f w Word n D
Step 2: MAP estimates of θand a x Author q Point estimates of unknown parameters z Topic f w Word n D
More Details on Learning • Gibbs sampling for x and z • Typically run 2000 Gibbs iterations • 1 iteration = full pass through all documents • Estimating θand • x and z sample -> point estimates • non-informative Dirichlet priors forθand • Computational Efficiency • Learning is linear in the number of word tokens • Predictions on new documents • can average over θand (from different samples, different runs)
Gibbs Sampling • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k
Experiments on Real Data • Corpora • CiteSeer: 160K abstracts, 85K authors • NIPS: 1.7K papers, 2K authors • Enron: 115K emails, 5K authors (sender) • Pubmed: 27K abstracts, 50K authors • Removed stop words; no stemming • Ignore word order, just use word counts • Processing time: Nips: 2000 Gibbs iterations 8 hours CiteSeer: 2000 Gibbs iterations 4 days
What can the Model be used for? • We can analyze our document set through the “topic lens” • Applications • Queries • Who writes on this topic? • e.g., finding experts or reviewers in a particular area • What topics does this person do research on? • Discovering trends over time • Detecting unusual papers and authors • Interactive browsing of a digital library via topics • Parsing documents (and parts of documents) by topic • and more…..
Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….
Temporal patterns in topics: hot and cold topics • We have CiteSeer papers from 1986-2002 • For each year, calculate the fraction of words assigned to each topic • -> a time-series for topics • Hot topics become more prevalent • Cold topics become less prevalent