20 likes | 129 Views
Training. Testing. Table 1. Comparison of the Sheffield QA system from last year (TREC-8) and this year (TREC-9) using the TREC-8 data. Table 2. Comparison of this year’s Sheffield QA system to the average of all systems entered on the TREC-9 task. Overall System Setup. Question. Indexed
E N D
Training... Testing... Table 1. Comparison of the Sheffield QA system from last year (TREC-8) and this year (TREC-9) using the TREC-8 data. Table 2. Comparison of this year’s Sheffield QA system to the average of all systems entered on the TREC-9 task. Overall System Setup Question Indexed TREC Document Collection OKAPI Passage Retrieval Text Filter QA-LaSIE Ranked Answers Top Passages Okapi (Information Retrieval) The Okapi IR system uses a probabilistic retrieval model. The passage retrieval system returns a variable length passage from min to max paragraphs long, where min and max can be set by the user. After a number of preliminary experiments, we found that the best performance on the TREC-8 QA task was obtained when min=1 and max=3 and the top 20 such passages were fed to the QA-LaSIE system. QA-LaSIE Component Modules qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4), lobj(e2,e3) Sentence Score: 2 Entity Score (e1): 0.91 Total (normalized): 0.97 Answer QLF: person(e1),name(e1,’Morris'), testify(e2), lsubj(e2,e1), release(e3), pronoun(e4,he), lsubj(e3,e4), worm(e5), lobj(e3,e5), proposition(e6), main_event(e6,e3), lobj(e2,e6) Answers: Shef50ea: “Morris” Shef50: “Morris testified that he released the internet wor” (50-byte cutoff) Shef250: (See text below for a full Shef250p: description of this output…) The University of Sheffield TREC-9 Q & A System Sam Scott and Rob Gaizauskas Department of Computer Science, University of Sheffield Summary • Overview • The University of Sheffield QA system for TREC-9 (QA-LaSIE) represents a significant development over the version of the same system entered for TREC-8. • Significantly better mean reciprocal rank (MRR) scores on the TREC-8 training data • Significantly better MRR scores on the TREC-9 data than the previous system obtained on the TREC-8 data. • This better performance is achieved using many of the same lower level components as the TREC-8 system. • An information retrieval (IR) system narrows down the test collection to a set of passages • A modified information extraction (IE) system performed syntactic and semantic analysis of the question and candidate answer texts Quick Review of Results System Details Top Level Components 1. Information Retrieval. First, the indexed documents are passed to the Okapi Passage Retrieval System system to narrow down the amount of data for the QA system. 2. Filtering. Top ranked passages from the IR system are passed through a text filter to deal with idiosyncrasies of the TREC data. 3. Question Answering. Filtered documents are passed to the QA-LaSIE system which parses and builds a semanticrepresentation of both the query and the topranked answer passagesand then tries to match a query variable in the question with anentity in the answer text's semantic representation. • QA-LaSIE in Action • QA-LaSIE is a modified version of the LaSIE IE system. • Special adaptations for the QA task are shown below. • Question parsing: • Phrase structure rules are used to parse differentquestion typesand produce a quasi-logical form (QLF) representation which contains: • a qvar predicate identifying the sought entity • a qattr predicate identifying the property or relationwhose value is sought for the qvar (this may not always be present.) Q:Who released the internet worm? A:Morris testified that he released the worm… Question QLF: QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown above right. 1. Tokenizer. Identifies token boundaries and text section boundaries. 2. Gazetteer Lookup. Matches tokens against domain-specific lists. Labels with appropriate name categories. 3. Sentence Splitter. Identifies sentence boundaries in the text body. 4. Brill Tagger. Assigns one of the 48 Penn TreeBank part-of-speech tags to each token in the text. 5. Tagged Morph. Identifies the root form and inflectional suffix for tokens tagged as nouns or verbs. 6. Parser. Performs two-pass bottom-up chart parsing first with a special named entity grammar, then with a general phrasal grammar. A “best parse” (possibly partial) is selected and a quasi-logical form(QLF) of each sentence is constructed. For the QA task, a special grammar module identifies the “sought entity” of a question and forms a special QLF representation for it. (See right panel). 7. Name Matcher. Matches variants of named entities across the text. 8. Discourse Interpreter. Adds the QLF representation to a semantic net containing background world and domain knowledge. Additional info inferred from the input is added to the model, and coreference resolution is attempted between instances mentioned in the text. For the question answering task, special code was added to find and score a possible answer entity from each sentence in the answer texts. (See right panel). 9. TREC-9 Question Answering Module. Examines the scores for each possible answer entity (see 8 above), and then outputs the top 5 answers formatted for each of the four submitted runs. (See right panel). Sentence Scoring:The co-reference system fromthe LaSIE discourse interpreter resolves coreferring entities both within the answer texts and between the answer and question texts. Also, the main verb in the question is matched to similar verbs in the answer text (black arrows at right). Each entity in the question is a “constraint”, and candidate answer sentences get one point for each constraint they contain. Entity Scoring:Each entity in a candidate answer sentence receives a normalized score based on: a) semantic and property similarity to the qvar, b) its relation to any “constraints”, and c) whether it shares with the qvar the same relation to a matched verb (red arrows at right). Answer Generation: The 5 highest scoring entities were used as the basis for the TREC-9 answer output. For the shef50ea run, the name of the entity was output if available. Otherwise the longest realization of the entity in the text was output. For the shef50 run, the first occurrence of the answer entity in the text was output. If it was less than 50 bytes long, the output was the entire sentence or a 50 byte window around the answer, whichever was shorter. Shef250 was the same as shef50 but with a limit of 250 bytes. Finally, shef250p was the same as shef250 but with extra padding from the surrounding text allowed to a 250 byte maximum. Results and Conclusions Evaluation on the TREC-8 Data During development, we evaluated our new version of QA-LaSIE against two benchmark systems: the version of QA-LaSIE submitted to TREC-8 and a naïve passage retrieval strategy using OKAPI. The testing data was the TREC-8 question set with the automatic evaluation script provided by NIST. We achieved significant improvements in Mean Reciprocal Rank (MRR) over both benchmark systems on the TREC-8 question set. Table 3. Full results for the four University of Sheffield runs compared to various baselines on the 198 TREC-8 Questions. Results on the TREC-9 Data Mean reciprocal rank scores for the four Sheffield runs are shown in Table 4, for both lenient and strict scorings. We have also computed the percentage of questions for which a correct answer was present in the top 5 answers returned by the system. At the time of writing full results for all systems had not been made available, but sufficient information was available to compute the mean score for systems in each category and these figures are also shown. Table 4. Results for the four QA-LaSIE runs on the TREC-9 questions. Discussion: These results represent a significant decline from the best results of this system on the TREC-8 questions. However, we are pleased with our improvements over the system entered last year. Detailed failure analysis has to be performed on our TREC-9 system, but we suspect that much of the performance decline compared to the TREC-8 questions is due to the TREC-9 questions coming from real question logs rather than artificial back-formulations of sentences found in the answer texts. Contact Details: Email – R.Gaizauskas@dcs.shef.ac.uk WWW – http://www.dcs.shef.ac.uk/~robertg