1 / 21

Finding Similar Questions in Large Question and Answer Archives

This paper discusses retrieval models for large question and answer archives, focusing on finding similar questions. It explores the use of word-to-word translation probabilities to solve the word mismatch problem in Q&A systems. Experimental results show the effectiveness of the proposed models in improving retrieval performance.

kevinf
Download Presentation

Finding Similar Questions in Large Question and Answer Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Similar Questions in Large Question and Answer Archives • Retrieval Models for Question and Answer Archives JiwoonJeon, W. Bruce Croft and Joon Ho Lee JiwoonJeon, W. Bruce Croft and XiaobingXue Presenter SawoodAlam<salam@cs.odu.edu>

  2. Finding Similar Questions in Large Question and Answer Archives JiwoonJeon, W. Bruce Croft and Joon Ho Lee Center for Intelligent Information Retrieval, Computer Science Department University of Massachusetts, Amherst, MA 01003 [jeon,croft,joonho]@cs.umass.edu • CIKM '05, Proceedings of the 14th ACM Conference on Information and Knowledge Management, 2005

  3. Introduction • Q&A systems quickly build large archives • Naver, a popular Korean search site gets 25,000+ questions per day • Great linguistic resource • Answering questions from the archive before a human response appear

  4. Q&A Over Usual Search • Opinion or summary • Direct answers rather than relevant documents • Search in collection of questions associated with answers • Lexical similarity vs. semantic similarity • Is downloading movies illegal? • Can I share a copy of a DVD online?

  5. Solving Word Mismatch Problem • Knowledge database (machine readable dictionaries) – unreliable performance • Manual rules or templates – hard to scale • Statistical technique – most promising • Requires large training data set

  6. Question and Answer Archive • Average lengths (words) • Title: 5.8 • Body: 49 • Answer: 179

  7. Relevance Judgments • Eighteen different retrieval results (varying retrieval algorithms) • Query likelihood, Okapi BM25 and overlap coeficient • Top 20 Q&A pairs from each retrieval result • Manual judgment • Correctness of answer was ignored • Manual browsing for missing relevant Q&A pairs

  8. Field Importance

  9. Generation of Training Sample • LM-HRANK • Sim(A, B) = (1/r1 + 1/r2) / 2 • Where: • Answer A retrieves B at rank r1 • Answer B retrieves A at rank r2

  10. Word Translation Probabilities

  11. Experiments and Results

  12. Examples and Analysis

  13. Retrieval Models for Question and Answer Archives JiwoonJeon Google, Inc. Mountain View, CA 94043, USA jjeon@google.com W. Bruce Croft and XiaobingXue Center for Intelligent Information Retrieval, Computer Science Department University of Massachusetts, Amherst, MA 01003 [croft,xuexb]@cs.umass.edu • SIGIR '08, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008

  14. Introduction • Word mismatch problem • Focus on translation based approach • Explanation of poor performance of pure IBM model vs. query-likelihood language model • Proposed a mixed model • Query part: translation based language model • Answer part: query likelihood language model

  15. LM vs. IBM model 1

  16. Question Part

  17. Answer Part • Gamma = 0: translation based (for question part) • Gamma = 1 : query likelihood LM (for answer part) • Beta = 0 : combination model

  18. Word-to-Word Translation Probability • Word “cheat” in question • “trust”, “forgive”, “dump” and “leave” etc. in answer • Word “cheat” in answer • “husband” and “boyfriend” etc. in question • All these words are useful to attack word mismatch problem • Combined probability used: P(Q|A) and P(A|Q)

  19. Examples

  20. Experimental Results

  21. Conclusions • Translation based language model for query part and QL language model for answer part • Experiment done on a Q&A web service where people answer others questions • Future work • Testing effect of proposed model on FAQ archives • Yahoo! Answers collection • Phrase based machine translation rather than word based translation

More Related