Latent Association Analysis of Document Pairs

Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011

DB2 logon Symptoms Diseases Belong to the same search task Users Treatments Queries Web pages Networked Texts Texts flow on expert networks Semantically associated texts Interconnected text streams UC Santa Barbara

+ Semantically Associated Documents

Applications • Software system maintenance • Root cause finding • Problem prediction • Machine translation • Question answering • Healthcare assistance UC Santa Barbara

Huge Datasets Beyond human learner’s capability UC Santa Barbara

Source Document Set Target Document Set Modeling Options • Word-level mapping • Topic-level mapping • Document-level mapping UC Santa Barbara

Word-Level Mapping (UAI’09) • Learns a dictionary between the two document sets • Applies to machine translation • Word mappings are typically noisy UC Santa Barbara

Topic-Level Mapping (EMNLP’09) • Assumes the associated documents share the same topic proportion • Works well for translations between languages UC Santa Barbara

Document-Level Mapping (our work) • One-to-many or many-to-one mappings are broken down into one-to-one document pairs • Two documents are associated by their association factor UC Santa Barbara

Latent Association Analysis – Framework • Generative process • Draw an association factor for each document pair • Draw topic proportions for both the source and the target document • Draw the words in each document UC Santa Barbara

Latent Association Analysis – An Instantiation • Canonical Correlation Analysis (CCA) • Captures the semantic association in document pairs • Correlated Topic Model (CTM) • Captures the document and word co-occurrence UC Santa Barbara

The Generative Process • A pair of documents arise from the following process • Draw an L-dimensional association factor • For the source/target document, draw the topic proportions • For each word in the documents, draw a topic and a word UC Santa Barbara

Problems • Inference • Given a model M and a document pair • How to determine the association factor, topic proportions and topic assignments that best describe the document pair? • Model fitting • Given a set of document pairs • How to calculate the parameters in M that best describes the entire document pair set? UC Santa Barbara

Inference • Objective function • Given a model and a document pair • Calculate the topic assignments and the topic proportions • Posterior distribution is intractable to compute • The topic assignments and the topic proportions are correlated when conditioned on observations UC Santa Barbara

Variational Inference • Decouple the parameters using a variational distribution Q • Fit the variational parameters to approximate the true posterior distribution UC Santa Barbara

Variational Parameters UC Santa Barbara

Model Fitting UC Santa Barbara

LAA Ranking Methods • Direct Ranking • Ranking function for a candidate document pair • Word frequency can distort the probability • Latent Ranking UC Santa Barbara

Two-Step Ranking • Separate Topic Models • Source document has topic proportion • Target document has topic proportion • Topic-Level Mapping • Canonical Correlation Analysis captures the association between the topic proportions • Rank Target Documents UC Santa Barbara

Experiments • Datasets • IT-Change: Changes made to an IT environment and the consequent problems • 24,317 document pairs • 20,000 used for training, the rest used for testing • IT-Solution: IT problems and their solutions • 19,696 document pairs • 15,000 used for training, the rest used for testing • Evaluation • Randomly select 100 document pairs in testing dataset • For each source document, rank the 100 target documents • Use the rank of the correct target document as accuracy measurement UC Santa Barbara

Accuracy Analysis UC Santa Barbara

Example UC Santa Barbara

Summary • The LAA framework is capable of modeling two document sets associated by a bipartite graph • One-to-many mappings or many-to-one mappings of documents are taken into consideration • We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications • The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms UC Santa Barbara

Acknowledgment • Prof. Louise E. Moser • Prof. Xifeng Yan • Dr. Shu Tao • Dr. Ziyu Guan • Dr. Nikos Anerousis UC Santa Barbara

Q & A? Thanks!

Unigram Model UC Santa Barbara

Mixture of Unigrams UC Santa Barbara

Probabilistic Latent Semantic Indexing UC Santa Barbara

LDA and CTM topic 1 topic 2 topic 3 topic 1 topic 2 topic 3 UC Santa Barbara

Latent Association Analysis of Document Pairs

Latent Association Analysis of Document Pairs

Presentation Transcript

Latent Semantic Analysis (LSA)

Latent Semantic Analysis

DOCUMENT ANALYSIS

Latent Transition Analysis

Document Analysis

Latent class trajectory analysis

Analysis Document

Document Analysis

Latent Tree Analysis of Unlabeled Data

Probabilistic Latent Semantic Analysis

Pairs Trading Analysis

Document Analysis

Latent Semantic Analysis

Introducing Latent Semantic Analysis

Analysis Document

Analysis Document

Latent Semantic Analysis (LSA)

Latent Class Analysis

Analysis of Beamstrahlung Pairs

Latent Semantic Analysis