290 likes | 400 Views
Latent Association Analysis of Document Pairs. Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011. DB2 logon. Symptoms. Diseases. Belong to the same search task. Users. Treatments. Queries. Web pages.
E N D
Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011
DB2 logon Symptoms Diseases Belong to the same search task Users Treatments Queries Web pages Networked Texts Texts flow on expert networks Semantically associated texts Interconnected text streams UC Santa Barbara
+ Semantically Associated Documents
Applications • Software system maintenance • Root cause finding • Problem prediction • Machine translation • Question answering • Healthcare assistance UC Santa Barbara
Huge Datasets Beyond human learner’s capability UC Santa Barbara
Source Document Set Target Document Set Modeling Options • Word-level mapping • Topic-level mapping • Document-level mapping UC Santa Barbara
Word-Level Mapping (UAI’09) • Learns a dictionary between the two document sets • Applies to machine translation • Word mappings are typically noisy UC Santa Barbara
Topic-Level Mapping (EMNLP’09) • Assumes the associated documents share the same topic proportion • Works well for translations between languages UC Santa Barbara
Document-Level Mapping (our work) • One-to-many or many-to-one mappings are broken down into one-to-one document pairs • Two documents are associated by their association factor UC Santa Barbara
Latent Association Analysis – Framework • Generative process • Draw an association factor for each document pair • Draw topic proportions for both the source and the target document • Draw the words in each document UC Santa Barbara
Latent Association Analysis – An Instantiation • Canonical Correlation Analysis (CCA) • Captures the semantic association in document pairs • Correlated Topic Model (CTM) • Captures the document and word co-occurrence UC Santa Barbara
The Generative Process • A pair of documents arise from the following process • Draw an L-dimensional association factor • For the source/target document, draw the topic proportions • For each word in the documents, draw a topic and a word UC Santa Barbara
Problems • Inference • Given a model M and a document pair • How to determine the association factor, topic proportions and topic assignments that best describe the document pair? • Model fitting • Given a set of document pairs • How to calculate the parameters in M that best describes the entire document pair set? UC Santa Barbara
Inference • Objective function • Given a model and a document pair • Calculate the topic assignments and the topic proportions • Posterior distribution is intractable to compute • The topic assignments and the topic proportions are correlated when conditioned on observations UC Santa Barbara
Variational Inference • Decouple the parameters using a variational distribution Q • Fit the variational parameters to approximate the true posterior distribution UC Santa Barbara
Variational Parameters UC Santa Barbara
Model Fitting UC Santa Barbara
LAA Ranking Methods • Direct Ranking • Ranking function for a candidate document pair • Word frequency can distort the probability • Latent Ranking UC Santa Barbara
Two-Step Ranking • Separate Topic Models • Source document has topic proportion • Target document has topic proportion • Topic-Level Mapping • Canonical Correlation Analysis captures the association between the topic proportions • Rank Target Documents UC Santa Barbara
Experiments • Datasets • IT-Change: Changes made to an IT environment and the consequent problems • 24,317 document pairs • 20,000 used for training, the rest used for testing • IT-Solution: IT problems and their solutions • 19,696 document pairs • 15,000 used for training, the rest used for testing • Evaluation • Randomly select 100 document pairs in testing dataset • For each source document, rank the 100 target documents • Use the rank of the correct target document as accuracy measurement UC Santa Barbara
Accuracy Analysis UC Santa Barbara
Example UC Santa Barbara
Summary • The LAA framework is capable of modeling two document sets associated by a bipartite graph • One-to-many mappings or many-to-one mappings of documents are taken into consideration • We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications • The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms UC Santa Barbara
Acknowledgment • Prof. Louise E. Moser • Prof. Xifeng Yan • Dr. Shu Tao • Dr. Ziyu Guan • Dr. Nikos Anerousis UC Santa Barbara
Q & A? Thanks!
Unigram Model UC Santa Barbara
Mixture of Unigrams UC Santa Barbara
Probabilistic Latent Semantic Indexing UC Santa Barbara
LDA and CTM topic 1 topic 2 topic 3 topic 1 topic 2 topic 3 UC Santa Barbara