IR4QA: An Unhappy Marriage Mark A. Greenwood

IR4QA: An Unhappy Marriage Mark A. Greenwood

Outline of Talk • Background • ‘Ancient’ History • Recent Past • An Uncertain Future • Possible New Directions

Background Although QA is not new, the language processing community has yet to develop a clearly articulated and commonly accepted guiding framework and research methodology, parallel to that of IR, MT, or text summarization. As a result, despite ten years of system evaluations in the TREC QA track for specific kinds of questions and answers, the community does not have a clear idea how much progress was made during that period for QA in general. OAQA09 Call for Papers

Background • We will focus here on the selection of promising documents which can be subjected to further processing in order to extract exact answers to questions. • The common approach to this problem has been to employ an IR engine to retrieve a small set of relevant documents, a field known as IR4QA. • The rest of this talk will explain • How we got to this point • Why it is fundamentally flawed • Where we might go from here

‘Ancient’ History • Traditionally IR and QA were separate research areas • They had different users and goals • The inputs and outputs to both systems were radically different • Both had their own strengths and weaknesses

‘Ancient’ History • Early QA systems were usually just interfaces to structured data • LUNAR (Woods, 1973) • BASEBALL (Green et al., 1961) • Those systems which worked over text were usually based around reading comprehension exercises and used scenario templates • SAM (Schank and Abelson, 1977) • Questions varied in length but were asking for information which wasn’t known to the user • Systems were not open-domain, i.e. LUNAR only knew about moon rocks.

‘Ancient’ History • In comparison to QA systems early IR systems could be applied to any document collection • Performance varied from collection to collection but in principal • Queries were usually quite long and described the documents the user was looking for • The CACM collection is a good example • Systems returned full documents not exact answers • As the user already knew what they were looking for this was OK • Full documents doesn’t help when you don’t know what you are looking for as you then have to read all the returned documents

Recent Past • Recent QA research has been guided by the TREC evaluations • The TREC QA track was originally conceived as a task that would interest both the IR and IE communities • Focused IR • Open-Domain IE • It was hoped that over time the two communities would work together to develop new combined approaches • Unfortunately it would seem that the IR community is not, on the whole, interested in the QA task

Recent Past • Most, if not all, modern QA systems have adopted a (roughly) three stage architecture: question analysis, document retrieval, and answer extraction.

Recent Past • IR4QA has not been aggressively researched by the community yet we know that... • IR performance places an upper-bound on end-to-end performance – a commonly quoted figure is 60% (Tellex et al., 2003) • Even if we look at the top 1000 documents no relevant documents are returned for 8% of the questions (Hovy et al., 2000) • Most systems use off-the-shelf IR components with little or no tuning to the task, i.e. Lucene, Okapi... • Complex multi-query strategies have been tried in an effort to solve the problem, but they only serve to highlight how bad performance at this step actually is.

Recent Past • IR4QA has focused on the development and evaluation of the document retrieval component in such systems. • The main problems are • QA researchers are not IR researchers • We don’t fully understand the intricate details of IR engines • QA and IR are fundamentally different tasks

Recent Past • Commonly accepted evaluation framework consists of (Roberts and Gaizauskas, 2004) • Coverage – the proportion of documents for which at least one answer bearing document is retrieved • Redundancy – the average number of answer bearing documents retrieved for a question

Recent Past • There have been two workshops focused on the problem of IR4QA • Sheffield, SIGIR 2004 • Manchester, Coling 2008 • The main conclusions of both were that • IR4QA is very hard • Approaches that lead to increased IR performance do not necessarily lead to appreciable increases in end-to-end performance • Selection of documents shouldn’t be performed in isolation from the rest of the system

An Uncertain Future • It seems clear that, on the whole, the IR community are not interested in QA • Using off-the-shelf IR components has been shown to introduce unacceptable caps on performance • The IR4QA community need to consider radically different approaches to the problem of selecting relevant documents from large corpora

Possible New Directions • Answer extraction requires complex text processing • Answer extraction techniques don’t scale well • Some form of text selection component is required • There are two orthogonal directions we could take • Continue to use traditional IR techniques but discard the traditional view of what makes a document (and/or query) • Continue to work with traditional documents but use a radically different selection approach We need approaches that scale – working on AQUAINT size collections is nice for self contained experiments but shouldn’t be the end goal!

What Is A Document? • Topic Indexing and Retrieval (Ahn and Webber, 2008) throws away the common idea of documents while using a standard IR engine to directly retrieve answers not text. • Topics are entities that answer questions • People, companies, locations etc. • Topic documents are built by simply joining together all sentences from a corpus that contain the topic (or variations of, i.e. Bill Clinton and William Clinton) • QA is then a matter of retrieving the most relevant topic document using an IR engine and returning the associated topic as the answer

What Is A Document?

Let The Data Guide You • A decade of recent QA research has yielded a lot of useful data • We have lots of example questions (at least a few thousand just from TREC) each of which... • Has a known correct answer • Is associated with at least one answer bearing document • We should use this data to guide new selection approaches. • A simple approach would be to perform query expansion by looking for terms which are often associated with correct answers to certain question types (Derczynski et al., 2008) • Look for patterns in the answer bearing documents and index collections based on these patterns rather than words

Answer By Understanding • I’ve always been of the opinion that QA is intelligent IR • Where intelligence equates to some level of understanding • This suggests we should index meaning not just textual content. • Take into account co-reference when selecting text passages • Indexing relations should allow for more focused selection • ‘Hybrid’ search that uses annotations and text (Bhagdev et al., 2008)

Discussion

References • Kisuh Ahn and Bonnie Webber. 2008. Topic Indexing and Retrieval for Factoid QA. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA). • Ravish Bhagdev, Sam Chapman, Fabio Ciravegna, VitaveskaLanfranchi and Daniela Petrelli. 2008. Hybrid Search: Effectively Combining Keywords and Semantic Searches. In Proceedings of the 5th European Semantic Web Conference, ESWC 08, Tenerife. • Leon Derczynski, Jun Wang, Robert Gaizauskas and Mark A. Greenwood. 2008. A Data Driven Approach to Query Expansion in Question Answering. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA). • Bert F. Green, Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. BASEBALL: An Automatic Question Answerer. In Proceedings of the Western Joint Computer Conference, volume 19, pages 219--224. • Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk, and Chin-Yew Lin. 2000. Question Answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference. • Ian Roberts and Robert Gaizauskas. 2004. Evaluating Passage Retrieval Approaches for Question Answering. In Proceedings of 26th European Conference on Information Retrieval (ECIR’04), pages 72--84, University of Sunderland, UK. • Roger C. Schank and Robert Abelson. 1977. Scripts, Plans, Goals and Understanding. Hillsdale. • Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 41--47, Toronto, Canada, July. • William Woods. 1973. Progress in Natural Language Understanding - An Application to Lunar Geology. In AFIPS Conference Proceedings, volume 42, pages 441--450.

IR4QA: An Unhappy Marriage Mark A. Greenwood