1 / 29

Latent Association Analysis of Document Pairs

Latent Association Analysis of Document Pairs. Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011. DB2 logon. Symptoms. Diseases. Belong to the same search task. Users. Treatments. Queries. Web pages.

wood
Download Presentation

Latent Association Analysis of Document Pairs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011

  2. DB2 logon Symptoms Diseases Belong to the same search task Users Treatments Queries Web pages Networked Texts Texts flow on expert networks Semantically associated texts Interconnected text streams UC Santa Barbara

  3. + Semantically Associated Documents

  4. Applications • Software system maintenance • Root cause finding • Problem prediction • Machine translation • Question answering • Healthcare assistance UC Santa Barbara

  5. Huge Datasets Beyond human learner’s capability UC Santa Barbara

  6. Source Document Set Target Document Set Modeling Options • Word-level mapping • Topic-level mapping • Document-level mapping UC Santa Barbara

  7. Word-Level Mapping (UAI’09) • Learns a dictionary between the two document sets • Applies to machine translation • Word mappings are typically noisy UC Santa Barbara

  8. Topic-Level Mapping (EMNLP’09) • Assumes the associated documents share the same topic proportion • Works well for translations between languages UC Santa Barbara

  9. Document-Level Mapping (our work) • One-to-many or many-to-one mappings are broken down into one-to-one document pairs • Two documents are associated by their association factor UC Santa Barbara

  10. Latent Association Analysis – Framework • Generative process • Draw an association factor for each document pair • Draw topic proportions for both the source and the target document • Draw the words in each document UC Santa Barbara

  11. Latent Association Analysis – An Instantiation • Canonical Correlation Analysis (CCA) • Captures the semantic association in document pairs • Correlated Topic Model (CTM) • Captures the document and word co-occurrence UC Santa Barbara

  12. The Generative Process • A pair of documents arise from the following process • Draw an L-dimensional association factor • For the source/target document, draw the topic proportions • For each word in the documents, draw a topic and a word UC Santa Barbara

  13. Problems • Inference • Given a model M and a document pair • How to determine the association factor, topic proportions and topic assignments that best describe the document pair? • Model fitting • Given a set of document pairs • How to calculate the parameters in M that best describes the entire document pair set? UC Santa Barbara

  14. Inference • Objective function • Given a model and a document pair • Calculate the topic assignments and the topic proportions • Posterior distribution is intractable to compute • The topic assignments and the topic proportions are correlated when conditioned on observations UC Santa Barbara

  15. Variational Inference • Decouple the parameters using a variational distribution Q • Fit the variational parameters to approximate the true posterior distribution UC Santa Barbara

  16. Variational Parameters UC Santa Barbara

  17. Model Fitting UC Santa Barbara

  18. LAA Ranking Methods • Direct Ranking • Ranking function for a candidate document pair • Word frequency can distort the probability • Latent Ranking UC Santa Barbara

  19. Two-Step Ranking • Separate Topic Models • Source document has topic proportion • Target document has topic proportion • Topic-Level Mapping • Canonical Correlation Analysis captures the association between the topic proportions • Rank Target Documents UC Santa Barbara

  20. Experiments • Datasets • IT-Change: Changes made to an IT environment and the consequent problems • 24,317 document pairs • 20,000 used for training, the rest used for testing • IT-Solution: IT problems and their solutions • 19,696 document pairs • 15,000 used for training, the rest used for testing • Evaluation • Randomly select 100 document pairs in testing dataset • For each source document, rank the 100 target documents • Use the rank of the correct target document as accuracy measurement UC Santa Barbara

  21. Accuracy Analysis UC Santa Barbara

  22. Example UC Santa Barbara

  23. Summary • The LAA framework is capable of modeling two document sets associated by a bipartite graph • One-to-many mappings or many-to-one mappings of documents are taken into consideration • We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications • The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms UC Santa Barbara

  24. Acknowledgment • Prof. Louise E. Moser • Prof. Xifeng Yan • Dr. Shu Tao • Dr. Ziyu Guan • Dr. Nikos Anerousis UC Santa Barbara

  25. Q & A? Thanks!

  26. Unigram Model UC Santa Barbara

  27. Mixture of Unigrams UC Santa Barbara

  28. Probabilistic Latent Semantic Indexing UC Santa Barbara

  29. LDA and CTM topic 1 topic 2 topic 3 topic 1 topic 2 topic 3 UC Santa Barbara

More Related