510 likes | 620 Views
Models for Authors and Text Documents. Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford).
E N D
Models for Authors and Text Documents Mark SteyversUCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
These viewgraphs were developed by Professor Mark Steyvers and are intended for review by ICS 278 students. If you wish to use them for any other purposes please contact Professor Smyth (smyth@ics.uci.edu) or Professor Steyvers (msteyver@uci.edu)
Goal • Automatically extract topical content of documents • Learn association of topics to authors of documents • Propose new efficient probabilistic topic model: the author-topic model • Some queries that model should be able to answer: • What topics does author X work on? • Which authors work on topic X? • What are interesting temporal patterns in topics?
A topic is represented as a (multinomial) distribution over words
Documents as Topics Mixtures:a Geometric Interpretation P(word1) 1 topic 1 = document 0 topic 2 1 P(word2) P(word3) 1 P(word1)+P(word2)+P(word3) = 1
Previous topic-based models • Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI) • EM implementation • Problem of overfitting • Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA) • Clarified the pLSI model • Variational EM • Griffiths & Steyvers, (PNAS 2004) • Same generative model as LDA • Gibbs sampling technique for inference • Computationally simple • Efficient (linear with size of data) • Can be applied to >100K documents
Approach with Author-Topic Models • Combine author models with topic models • Ignore style, focus on content of document • Learn the topics that authors write about • Learn two matrices: Authors Topics Topics Words
Assumptions of Generative Model • Each author is associated with a topics mixture • Each document contains a mixture of topics • With multiple authors, the document will express a mixture of the topics mixtures of the co-authors • Each word in a text is generated from one topic and one author (potentially different for each word)
Generative Process • Let’s assume authors A1 and A2 collaborate and produce a paper • A1 has multinomial topic distribution q1 • A2 has multinomial topic distribution q2 • For each word in the paper: • Sample an author x (uniformly) from A1,A2 • Sample a topic z from a qX • Sample a word w from a multinomial topic distribution z
Graphical Model Matrix of author-topic distributions From the set of co-authors … 1. Choose an author Matrix of topic-word distributions 2. Choose a topic 3. Choose a word
Model Estimation • Estimate x and zby Gibbs sampling(assignments of each word to an author and topic) • Integrate out F and Q • Estimation is efficient: linear in data size • Infer: • Author-Topic distributions (Q) • Topic-Word distributions (F)
Gibbs sampling in Author-Topics • Need full conditional distributions for variables • The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k
Data • Corpora • CiteSeer: 160K abstracts, 85K authors • NIPS: 1.7K papers, 2K authors • Enron: 115K emails, 5K authors (sender) • Removed stop words; no stemming • Word order is irrelevant, just use word counts • Processing time: Nips: 2000 Gibbs iterations 12 hours on PC workstation CiteSeer: 700 Gibbs iterations 111 hours
Some likely topics per author (CiteSeer) • Author = Andrew McCallum, U Mass: • Topic 1: classification, training, generalization, decision, data,… • Topic 2: learning, machine, examples, reinforcement, inductive,….. • Topic 3: retrieval, text, document, information, content,… • Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. • Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….
Stability of Topics • Content of topics is arbitrary across runs of model(e.g., topic #1 is not the same across runs) • However, • Majority of topics are stable over processing time • Majority of topics can be aligned across runs • Topics represent genuine structure in data
Comparing NIPS topics from the same Markov chain BEST KL = 0.54 Re-ordered topics at t2=2000 WORST KL = 4.78 KL distance topics at t1=1000
Comparing NIPS topics from two different Markov chains BEST KL = 1.03 Re-ordered topics from chain 2 WORST KL = 9.49 KL distance topics from chain 1
Detecting Papers on Unusual Topics for Authors • We can calculate perplexity (unusualness) for words in a document given an author Papers ranked by perplexity for M. Jordan:
Author Separation • Can model attribute words to authors correctly within a document? • Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author A method1 is described which like the kernel1trick1 in support1vector1machines1SVMs1lets us generalizedistance1based2algorithms to operate in feature1spaces usually nonlinearlyrelated to the input1space This is done by identifying a class of kernels1 which can be represented as norm1based2distances1 in Hilbertspaces It turns1 out that commonkernel1algorithms such as SVMs1 and kernel1PCA1 are actually really distance1based2algorithms and can be run2 with that class of kernels1 too As well as providing1 a useful new insight1 into how these algorithmswork the present2work can form the basis1 for conceiving new algorithms This paperpresents2 a comprehensiveapproach for model2based2diagnosis2 which includesproposals for characterizing and computing2preferred2diagnoses2assuming that the system2description2 is augmented with a system2structure2 a directed2graph2 explicating the interconnections between system2components2Specifically we first introduce the notion of a consequence2 which is a syntactically2unconstrainedpropositional2sentence2 that characterizes all consistency2based2diagnoses2 and show2 that standard2characterizations of diagnoses2 such as minimalconflicts1correspond to syntactic2variations1 on a consequence2 Second we propose a new syntactic2variation on the consequence2 known as negation2normalform NNF and discuss its meritscompared to standardvariations Third we introduce a basicalgorithm2 for computingconsequences in NNF given a structuredsystem2description We show that if the system2structure2 does not contain cycles2 then there is always a linearsize2consequence2 in NNF which can be computed in lineartime2 For arbitrary1system2structures2 we show a preciseconnection between the complexity2 of computing2consequences and the topology of the underlyingsystem2structure2Finally we present2 an algorithm2 that enumerates2 the preferred2diagnoses2characterized by a consequence2 The algorithm2 is shown1 to take lineartime2 in the size2 of the consequence2 if the preferencecriterion1satisfies some generalconditions Written by (1) Scholkopf_B Written by (2) Darwiche_A
Temporal patterns in topics: hot and cold topics • We have CiteSeer papers from 1986-2001 • We can calculate time-series for topics • Hot topics become more prevalent • Cold topics become less prevalent • Do time-series correspond with known trends in computer science?
Comparison to models that use less information Topics model Author model (topics, no authors) (authors, no topics)
Matrix Factorization Interpretation AUTHOR-TOPIC MODEL Documents Topics Authors Documents A Topics = Words Words Authors TOPIC MODEL Documents Topics Documents Topics = Words Words AUTHOR MODEL Documents Author Documents A = Words Words Authors
Comparison Results • Train models on part of a new document and predict remaining words • Without having seen any words from new document, author-topic information helps in predicting words from that document • Topics model is more flexible in adapting to new document after observing a number of words
Author prediction with CiteSeer • Task: predict (single) author of new CiteSeer abstracts • Results: • For 33% of documents, author guessed correctly • Median rank of true author = 26 (out of 85,000)
Perplexities for true author and any random author A = true author A = any author
The Author-Topic Browser (a) Querying on author Pazzani_M Querying on topic relevant to author (b) Querying on document written by author http://www.ics.uci.edu/~michal/KDD/ATM.htm (c)
New Applications/ Future Work • Finding relevant email: • "find emails similar to this email based on content” • "find people who wrote emails similar in content to this one" • Reviewer Recommendation • “Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” • Change Detection/Monitoring • Which authors are on the leading edge of new topics? • Characterize the “topic trajectory” of this author over time • Author Identification • Who wrote this document? Incorporation of stylistic information