1 / 37

Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon , W. Bruce Croft and Joon Ho Lee ACM CIKM

Question Answering from Frequently Asked Question Files Robin D. Burke, Kristian J Hammond, Valdimir Kulyukin , Steven L. Lytinen , Noriko Tomurom and Scott Schoenberg AI magazine; Summer 1997.

andra
Download Presentation

Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon , W. Bruce Croft and Joon Ho Lee ACM CIKM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Question Answering from Frequently Asked Question Files Robin D. Burke, Kristian J Hammond, ValdimirKulyukin, Steven L. Lytinen, Noriko Tomurom and Scott SchoenbergAI magazine; Summer 1997 Finding Similar Questions in large Question and Answer ArchivesJiwoonJeon, W. Bruce Croft and Joon Ho LeeACM CIKM ‘05 Presented by Mat KellyCS895 – Web-based Information Retrieval Old Dominion University December 13, 2011

  2. What is FAQ Finder? • Matches answers to questions already asked in a site’s FAQ file • 4 Assumptions • Information in QA Format • All information needed to determine relevance of QA is can be found in QA Pair • Q half of QA pair most relevant for matching to user’s question • Broad, shallow knowledge of language is sufficient for question matching

  3. How Does It Work? • Uses SMART IR system to narrow focus of relevant FAQ files • Iterates through QA pairs in FAQ file, comparing against user’s question and computing a score using 3 metrics • Statistical term-vector similarity score t • Semantic similarity score s • Coverage score c T,S and C are constant weights that adjust reliance of system on each metric.

  4. Calculating Similarity • QA pair represented as a term vector w/ signif. Values for each term in the pair • Significance value = tfidf • n (term freq) = # time term appears in QA pair • m = # QA pairs term appears in in file • tfidf = n x log(M/m) • Evaluate relative rarity of term within documents • Use as factor to weight freq of term in document

  5. Nuances • Many ways to express the same question • Synonymous terms often used in large documents • Thus, variations will have no effect • However, FAQ Finder is matching on small # of terms, system needs means of matching synonyms • How do I reboot my system? • What do I do when my computer crashes? • Causal relationship resolved with WordNet

  6. WordNet • Semantic network of English words • Provides relations between words and synonym sets & between synonym sets and themselves • FAQ Finder utilizes through marker-passing algorithm • Compares each word in the user’s question to each word in FAQ file question

  7. WordNet (cont…) • Not a single semantic network, different sub-networks exist for nouns, verbs, etc. • Syntactically ambiguous words (e.g. run) appears in more than one network. • Simply relying on default word sense worked as well as any more sophisticated techniques

  8. Evaluating Performance • Corpus from log file of system’s use • May-Dec 1996. • 241 questions used • Manually scanned and found 138 answers to questions and 103 questions unanswered • Assumes there is a correct (single QA pair) • Because this task is different than conventional IR problem, have to redefine recall and precision

  9. Why Redefine Recall & Precision? • RECALL – typically is measurement of % of relevant docs in set relative to query • PRECISION – typically measurement of % retrieved docs that are relevant • There is only one correct doc, these are not independent • e.g. query returns 5 QA pairs • FAQ Finder returns either 100% recall and 20% precision OR • Returns 0% recall, 0% precision • If no answer exists, precision = 0%, recall = undefined

  10. Redefining Recall & Precision • Recallnew=% questions FAQFinder returns correct answer when one exists • Does not penalize if >1 correct answer (original) • Instead of precision, calculate rejection • Rejection - % questions FAQFinder correctly reports as being answered • Adjusted to set cutoff point for minimum-allowable-matches • There is still a tradeoff between rejection and recall • Rejection threshold too high, some correct answers eliminated • Rejection too low, incorrect answers given to user when no answer exists

  11. Results • Correct file appears 88% of time within top 5 files returned, 48% of time in first position Equates to 88% Recall, 23% Precision • System confidently returns garbage when there is not correct answer in file

  12. Ablation Study • Evaluation of different components in matching scheme by disabling • QA pairs selected randomly from FAQ file • Coverage score for each condition used by itself • Semantic scores from WordNet used in eval • Term vector comparison used in isolation

  13. Conditions’ Contributions • WordNet and stat technique contribute strongly • Their combination yields results that are better than either individually.

  14. Where FAQ Finder Fails • Biggest culprit of not finding is undue weight given to semantically useless words • Where can I find woodworking plans for a futon‽ • woodworking is incorporated as strongly as futon • futon should be much more important inside the woodworking FAQ than woodworking, which applies to everything • Other problem: violation of assumptions about FAQ files

  15. Conclusion • When there is an existing collection of Qs & As, Qs can be reduced to matching new questions against QA pairs • Power of approach is because FAQ Finder uses highly organized knowledge sources that are designed to answer commonly asked Qs.

  16. Citing Paper’s Objectives • Find questions in archive semantically similar to user’s question. • Resolve: • Two questions that have the same meaning use very different wording • Similarity measures developed for document retrieval work poorly when there is little word overlap.

  17. Approaches Toward The Word Mismatch Problem • Use knowledge databases as machine readable dictionaries (req. from first paper) • Current quality and structure are insufficient • Employ manual rules and templates • Expensive and hard to scale for large collections • Use statistical techniques from IR and natural language processing • Most promising with enough trained data

  18. Problems with the Statistical Approach • Need: Large # of semantically similar but lexically different sentences or Q pairs • No such collection exists on large scale • Researchers artificially generate collections through methods like translation and subsequent reverse translation • Paper proposed automatic way of building collections of semantically similar questions from existing Q&A collections

  19. Question & Answer Archives • Naver – leading portal site in S. Korea. Ex.  • Avglen of Q field = 5.8w • Avg Q body = 49w • Avg Answer = 179w • Made 2 test collections from archive • A-6.8M QA Pairs across all categories • B-68k QA Pairs across “Computer Novice” Categ.

  20. Need: Sets of topics with relevance judgments • 2 sets of 50 QA pairs rand. Selected • First set from Collection A and chosen across all Cats • Second set from Collection B, chosen from “Comp. Novice” category • Each pair converted to topic • QTITLE short query • QBODYlong query • A supplemental query } Used only in relevance judgement procedure

  21. Find Relevant QA Pairs • Given a topic, employ TREC pooling technique • 18 diff. retrieval results generated by varying retrieval algorithm, query type & search field • Retrieval models such as Okapi BM25, query-likelihood and overlap coefficient used • Pooled top 20 QA pairs from each, did manual relevance judgments • As long as seman. Identical or very similar to query, QA pair is considered relevant • If no QA pairs found for a given topic, manually browse the collection to find ≥1 QA pair • Result = 785 Relevant QA Pairs for A, 1557 for B

  22. Verifying Field Importance • Prev. Research: Similarity between questions is more important than similarity betw. Qs & As in FAQ Retrieval task • Exp. 1: Search only QTitle field • Exp. 2: Only QBody • Exp 3: Only Answer • For allexps, use query likelihood model with Dirichlet smoothing and Okapi BM25 Regardless of retrieval model, best performance from searching the question title field. Performance gaps for others are significant.

  23. Collecting Semantically Similar Questions • Many people don’t search to see if Q has already been asked, so ask a seman. similar Q. • Assume: If two answers are similar then corresponding Qs are semantically similar but lexically different. Sample semantically similar questions with little word overlap

  24. Algorithm • Consider 4 popular document similarity measures: • Cosine similarity with vector space model • Negative KL divergence between language models • Output score of query likelihood model • Score of Okapi model

  25. Finding a Similarity Measure: The Cosine Similarity Model • Length of answers vary considerably • Some very short (factoids) • Others very long (C&P from web) • Any similarity measure affected by length is not appropriate

  26. Finding a Similarity Measure: Negative KL Divergence & Okapi • Values are not symmetric and not probabilities • pair of answers that has a higher negative KL divergence than another pair does not necessarily have stronger semantic connections • Hard to rank pairs • Okapi Model has Similar Problems

  27. Finding a Similarity Measure: Query Likelihood Model • Score is a probability. • Can be used across different answer pairs • Score are NOT symmetric

  28. Overcoming Problems • Using ranks instead of scores was more effective • If answer A retrieves answer B @ rank r1and answer B retrieves answer A @ rank r2 then similarity between 2 answers = reverse harmonic mean of two ranks: • Use query likelihood model to calc init. ranks

  29. Experiments & Results • 68,000*67,999/2 answers possible from 68,000 Q&A pairs in Collection B • All ranked using established measure • Empirically set threshold 0.005 • Judge whether pair is related or not • Higher threshold = smaller but better quality collections • To acquire enough training samples, threshold cannot be too high • 331,965 pairs have score above threshold

  30. Word Translation Probabilities • Question pair collection a parallel corpus • IBM model 1 • Does not require any linguistic knowledge for src/target language, treats every word alignment equally • Translation from srcs to target t = • λs = normalization factor, so sum of probs = 1 • N = # training samples • Ji= ith pair in training set

  31. Word Translation Probabilities (cont) • {s1,…,sn} = words in src sentence in Ji • #(t,Ji) = number of times t occurs in Ji • Still need: old translation probs • We initialize translation probs with rand values, then est. new translation probs • Repeat until probs converge • Procedure always converges to same final solution1 [1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguis., 19(2):263-311, 1993.

  32. Experiments & Results(Word Translation) • Removed stop words • Collection of 331965 Q pairs duplicated by switching src and target pars then used as input • Usually: most similar word to a given word is the word itself • Found semantic relationships: found “bmp” to be similar to “jpg” and “gif”

  33. Question Retrieval • Where to go from Q titles from word translation probs? • Similarity between query and document: • Avoid 0 Probs, est. more accurate lang. models • term w generated from collection C/D • In translation model, convert to:

  34. Experiments & Results(Question Retrieval) • 50 short queries from collection B, searching only title field • Similarities betw. query Q and Q titles calculated • Compare performance model with vector space model w/ cosine similarity, Okapi BM25 and query likelihood language model

  35. Experiments & Results cont…(Question Retrieval) • Approach outperforms other baseline models at recall levels • QL and Okapi show comparable performance • In all evaluations, approach outperforms other models

  36. Conclusions and Seminal Paper Relevance • Retrieval model based on translation probs learned from archive significantly outperforms other approaches in finding semantically similar questions despite lexical mismatch • Using translation probabilities and determining similarity of answers is a much more robust approach for resolving similar QA pairs with fewer prerequisite of corpus

  37. References • Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N., & Schoenberg, S. (1997). Question answering from frequently asked question files: Experience with the FAQ finder system (Tech. Rep.). Chicago,, IL, USA. • JiwoonJeon, W. Bruce Croft, and Joon Ho Lee. 2005. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM '05). ACM, New York, NY, USA, 84-90.

More Related