1 / 28

Question Answering: Framework, Types, and Evaluation

This talk provides an overview of question answering technology, including different question types, a generic question answering framework, evaluating question answering systems, and a detailed example. It also discusses desktop question answering and compares it to other online systems.

frankobrien
Download Presentation

Question Answering: Framework, Types, and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AnswerFinderQuestion Answering from your Desktop Mark A. GreenwoodNatural Language Processing Group Department of Computer Science University of Sheffield, UK

  2. Outline of Talk • What is Question Answering? • Different Question Types • A Generic Question Answering Framework • Evaluating Question Answering Systems • System Description • Question Typing • Information Retrieval • Locating Possible Answers • A Detailed Example • Results and Evaluation • Desktop Question Answering • A Brief Comparison to Other On-Line Question Answering Systems • Conclusions and Future Work CLUK, 7th Annual Research Colloquium

  3. What is Question Answering? • The main aim of QA is to present the user with a short answer to a question rather than a list of possibly relevant documents. • As it becomes more and more difficult to find answers on the WWW using standard search engines, question answering technology will become increasingly important. • Answering questions using the web is already enough of a problem for it to appear in fiction (Marshall, 2002): “I like the Internet. Really, I do. Any time I need a piece of shareware or I want to find out the weather in Bogotá… I’m the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd.” CLUK, 7th Annual Research Colloquium

  4. Different Question Types • Clearly there are many different types of questions: • When was Mozart born? • Question requires a single fact as an answer. • Answer may be found verbatim in text i.e. “Mozart was born in 1756”. • How did Socrates die? • Finding an answer may require reasoning. • In this example die has to be linked with drinking poisoned wine. • How do I assemble a bike? • The full answer may require fusing information from many different sources. • The complexity can range from simple lists to script-based answers. • Is the Earth flat? • Requires a simple yes/no answer. • The systems outlined in this presentation attempt to answer the first two types of question. CLUK, 7th Annual Research Colloquium

  5. A Generic QA Framework • A search engine is used to find the n most relevant documents in the document collection. • These documents are then processed with respect to the question to produce a set of answers which are passed back to the user. • Most of the differences between question answering systems are centred around the document processing stage. CLUK, 7th Annual Research Colloquium

  6. Evaluating QA Systems • The biggest independent evaluations of question answering systems have been carried out at TREC (Text Retrieval Conference) over the past five years. • Five hundred factoid questions are provided and the groups taking part have a week in which to process the questions and return one answer per question. • No changes to systems are allowed between the time the questions are received and the time at which the answers are submitted. • Not only do these annual evaluations give groups a chance to see how their systems perform against those from other institutions but more importantly it is slowly building an invaluable collection of resources, including questions and their associated answers, which can be used for further development and testing. • Different metrics have been used over the years but the current metric is simply the percentage of questions correctly answered. CLUK, 7th Annual Research Colloquium

  7. Outline of Talk • What is Question Answering? • Different Question Types • A Generic Question Answering Framework • Evaluating Question Answering Systems • System Description • Question Typing • Information Retrieval • Locating Possible Answers • A Detailed Example • Results and Evaluation • Desktop Question Answering • A Brief Comparison to Other On-Line Question Answering Systems • Conclusions and Future Work CLUK, 7th Annual Research Colloquium

  8. System Description • Many of the systems which have proved successful in previous TREC evaluations have made use of a fine grained set of answer types. • One system (Harabagiu et al., 2000) has an answer type DOG BREED • The answer topology described in (Hovy et al., 2000) contains 94 different answer types. • The original idea behind building the QA system underlying AnswerFinder was to determine how well a system which used only a fine grained set of answer types could perform. • The completed system consists of three distinct phases: • Question Typing • Information Retrieval • Locating Possible Answers CLUK, 7th Annual Research Colloquium

  9. Question Typing • The first stage of processing is to determine the semantic type of the expected answer. • The semantic type, S, is determined through rules which examine the question, Q: • If Q contains ‘congressman’ and does not start with ‘where’ or ‘when’ then S is person:male • If Q contains ‘measured in’ then S is measurement_unit • If Q contains ‘univesity’ and does not start with ‘who’, ‘where’ or ‘when’ then S is organization • If Q contains ‘volcano’ and does not start with ‘who’ or ‘when’ then S is location • The current system includes rules which can detect 46 different answer types. CLUK, 7th Annual Research Colloquium

  10. Information Retrieval • This is by far the simplest part of the question answering system with the question being passed, as is, to an appropriate search engine: • Okapi is used, to search the AQUAINT collection, when answering the TREC questions. • XXXXXXXX is used, to search the Internet, when using AnswerFinder as a general purpose question answering system. • The top n relevant documents, as determined by the search engine, are then retrieved ready for the final processing stage. CLUK, 7th Annual Research Colloquium

  11. Locating Possible Answers • The only answers we attempt to locate are entities which the system can recognise. • Locating possible answers consists therefore of extracting all entities of the required type from the relevant documents. • Entities are currently extracted using modified versions of the gazetteer lists and named entity transducer supplied with the GATE 2 framework (Cunningham et al., 2002). • All entities of the correct type are retained as possible answers unless they fail one or both of the following tests: • The document the current entity appears in must contain all the entities in the question. • A possible answer entity must not contain any of the question words (ignoring stopwords). CLUK, 7th Annual Research Colloquium

  12. Locating Possible Answers • All the remaining entities are then grouped together using the following equivalence test (Brill et al., 2001): Two answers are said to be equivalent if all of the non-stopwords in one are present in the other or vice versa. • The resulting answer groups are then ordered by: • the frequency of occurrence of all answers within the group • the highest ranked document in which an answer in the group appears. • This sorted list (or the top n answers) is then presented, along with a supporting snippet, to the user of the system. CLUK, 7th Annual Research Colloquium

  13. If Q contains ‘how’ and ‘high’ then the semantic class, S, is measurement:distance 29,035 feet A Detailed Example Q: How high is Everest? D1: Everest’s 29,035 feet is 5.4 miles above sea level… D2: At 29,035 feet the summit of Everest is the highest… # Known Entities 2 location(‘Everest’) 2 measurement:distance(‘29,035 feet’) 1 measurement:distance(‘5.4 miles’) CLUK, 7th Annual Research Colloquium

  14. Outline of Talk • What is Question Answering? • Different Question Types • A Generic Question Answering Framework • Evaluating Question Answering Systems • System Description • Question Typing • Information Retrieval • Locating Possible Answers • A Detailed Example • Results and Evaluation • Desktop Question Answering • A Brief Comparison to Other On-Line Question Answering Systems • Conclusions and Future Work CLUK, 7th Annual Research Colloquium

  15. Results and Evaluation • The underlying system was tested over the 500 factoid questions used in TREC 2002 (Voorhees, 2002): • Results for the question typing stage were as follows: • 16.8% (84/500) of the questions were of an unknown type and hence could never be answered correctly. • 1.44% (6/416) of those questions which were typed were given the wrong type and hence could never be answered correctly. • Therefore the maximum attainable score of the entire system, irrespective of any future processing, is 82% (410/500). • Results for the information retrieval stage were as follows: • At least one relevant document was found for 256 of the of the correctly typed questions. • Therefore the maximum attainable score of the entire system, irrespective of further processing, is 51.2% (256/500). CLUK, 7th Annual Research Colloquium

  16. Results and Evaluation • Results for the question answering stage were as follows: • 25.6% (128/500) questions were correctly answered by the system using this approach. These results are not overly impressive especially when compared with the best performing systems which can answer approximately 85% of the same five hundred questions (Moldovan et al, 2002). • Users of web search engines are, however, used to looking at a set of relevant documents and so would probably be happy looking at a handful of short answers. • If we examine the top five answers returned for each question then the system correctly answers 35.8% (179/500) of the questions which is 69.9% (179/256) of the maximum attainable score. • If we examine all the answers returned for each question then 38.6% (193/500) of the questions are correctly answered which is 75.4% (193/256) of the maximum attainable score, but this involves displaying over 20 answers per question. CLUK, 7th Annual Research Colloquium

  17. Outline of Talk • What is Question Answering? • Different Question Types • A Generic Question Answering Framework • Evaluating Question Answering Systems • System Description • Question Typing • Information Retrieval • Locating Possible Answers • A Detailed Example • Results and Evaluation • Desktop Question Answering • A Brief Comparison to Other On-Line Question Answering Systems • Conclusions and Future Work CLUK, 7th Annual Research Colloquium

  18. Desktop Question Answering • Question answering may be an interesting research topic but what is needed is an application that is as simple to use as a modern web search engine. • No training or special knowledge required to use. • Must respond within a reasonable period of time. • Answers should be exact but should also be supported by a small snippet of text so that users don’t have to read the supporting document to verify the answer • AnswerFinder attempts to meet all of these requirements… CLUK, 7th Annual Research Colloquium

  19. Desktop Question Answering When was Gustav Holst born? CLUK, 7th Annual Research Colloquium

  20. Brief Comparison - PowerAnswer • PowerAnswer is developed by the team responsible for the best performing TREC system. • At TREC 2002 their entry answered approx. 85% of the questions. • Unfortunately PowerAnswer acts more like a search engine than a question answering system: • Each answer is a sentence or long phrase • No attempt is made to cluster/remove sentences which contain the same answers • This is strange as TREC results show that this system is very good at finding a single exact answer to a question. … and get the answer! CLUK, 7th Annual Research Colloquium

  21. Brief Comparison - AnswerBus • Very similar to PowerAnswer in that: • The answers presented are full sentences. • No attempt is made to cluster/remove sentences containing the same answer. • The interesting thing to note about AnswerBus is that questions can be asked in more than one language; English, French, Spanish, German, Italian or Portuguese – although all answers are given in English. • The developer claims the system answers 70.5% of the TREC 8 questions, although • The TREC 8 question set is not a good reflection of real world questions, • Finding exact answers, as the TREC evaluations have shown, is a harder task than simply finding answer bearing sentences. CLUK, 7th Annual Research Colloquium

  22. Brief Comparison - XXXX • The NSIR system, from the University of Michigan, is much closer to AnswerFinder than PowerAnswer or AnswerBus: • Uses standard web search engines to find relevant documents. • Returns a list of ranked exact answers. • Unfortunately no context or confidence level is given for each answer so users would still have to refer to the relevant documents to verify that a given answer is correct. • NSIR was entered in TREC 2002 correctly answering 24.2% of the questions. • Very similar to the 25.6% obtained by AnswerFinder over the same question set. CLUK, 7th Annual Research Colloquium

  23. Brief Comparison - IONAUT • IONAUT is the system most close to AnswerFinder when viewed from the user’s perspective. • A ranked list of answers is presented. • Supporting snippets of context are also displayed. • Unfortunately the exact answers are not linked to specific snippets, so it is not immediately clear which snippet supports which answer. • This problem is compounded by the fact that multiple snippets may support a single answer as no attempt has been made to cluster/remove snippets which support the same answer. CLUK, 7th Annual Research Colloquium

  24. Outline of Talk • What is Question Answering? • Different Question Types • A Generic Question Answering Framework • Evaluating Question Answering Systems • System Description • Question Typing • Information Retrieval • Locating Possible Answers • A Detailed Example • Results and Evaluation • Desktop Question Answering • A Brief Comparison to Other On-Line Question Answering Systems • Conclusions and Future Work CLUK, 7th Annual Research Colloquium

  25. Conclusions • The original aim in developing the underlying question answering system was to determine how well only a fine grained system of answer types would perform. • The system answers approximately 26% of the TREC 11 questions. • The average performance by participants in TREC 11 was 22%. • The best performing system at TREC 11 scored approximately 85%. • The aim of developing AnswerFinder was to provide access to question answering technology in a manner similar to current web search engines. • An interface similar to a web browser is used to both enter the question and to display the answers. • The answers are displayed in a similar fashion to standard web search results. • Very little extra time is required to locate possible answers over and above simply collecting the relevant documents. CLUK, 7th Annual Research Colloquium

  26. Future Work • The question typing stage could be improved through either the edition of more rules or by replacing the rules with an automatically acquired classifier (Li and Roth, 2002). • It should be clear that increasing the types of entities we can recognise will increase the percentage of questions we can answer. Unfortunately this is a task that is both time-consuming and never-ending. • A possible extension to this approach is to include answer extraction patterns (Greenwood and Gaizauskas, 2003). • These patterns are enhanced regular expressions in which certain tags will match multi-word terms. • For example questions such as “What does CPR stand for?” generate patterns such as “NounChunK ( X )” where CPR is substituted for X to select a noun chunk that will be suggested as a possible answer. CLUK, 7th Annual Research Colloquium

  27. Any Questions? Copies of these slides can be found at: http://www.dcs.shef.ac.uk/~mark/phd/work/ AnswerFinder can be downloaded from: http://www.dcs.shef.ac.uk/~mark/phd/software/

  28. Bibliography Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais and Andrew Ng. Data-Intensive Question Answering. In Proceedings of the 10th Text REtrieval Conference, 2001. Hamish Cunningham, Diana Maynard, Kalina Bontcheva and Valentin Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002. Mark A. Greenwood and Robert Gaizauskas. Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering. In Proceedings of the Workshop on Natural Language Processing for Question Answering (EACL03), pages 29–34, Budapest, Hungary, April 14, 2003. Sanda Harabagiu, Dan Moldovan, Marius Paşca, Rada Mihalcea, Mihai Surdeanu, Răzvan Bunescu, Roxana Gîrju, Vasile Rus and Paul Morărescu. FALCON: Boosting Knowledge for Answer Engines. In Proceedings of the 9th Text REtrieval Conference, 2000. Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk and Chin-Yew Lin. Question Answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference, 2000. Xin Li and Dan Roth. Learning Question Classifiers. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), 2002. Michael Marshall. The Straw Men. HarperCollins Publishers, 2002. Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Morarescu, Finley Lacatusu, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. LCC Tools for Question Answering. In Proceedings of the 11th Text REtrieval Conference, 2002. Ellen M. Voorhees. Overview of the TREC 2002 Question Answering Track. In Proceedings of the 11th Text REtrieval Conference, 2002. CLUK, 7th Annual Research Colloquium

More Related