120 likes | 257 Views
A Trainable Multi-factored QA System. Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu. Research Institute for Artificial Intelligence, Romanian Academy. ResPubliQA. We participated in the Romanian-Romanian ResPubliQA task
E N D
A Trainable Multi-factored QA System Radu Ion, Dan Ştefănescu, Alexandru Ceauşu, Dan Tufiş, Elena Irimia, Verginica Barbu-Mititelu Research Institute for Artificial Intelligence, Romanian Academy
ResPubliQA • We participated in the Romanian-Romanian ResPubliQA task • 500 juridical questions to be answered from the Romanian JRC Acquis (10714 docs) • Questions have been translated from other languages => a more difficult QA task since translated terms are not necessarily expressed the same in the actual Romanian documents
Corpus processing and indexing • POS tagging, lemmatization, chunking. • Only the ‘body’ part of a document was indexed (no annexes, no headers) • We have two Lucene indexes: a document index and a paragraph index • What’s in the index: lemmas and paragraph classes for the paragraph index
QA flow • Web services based: • Question preprocessing using TTL (http://ws.racai.ro/ttlws.wsdl) • Question classification using a ME classifier (http://shadow.racai.ro/JRCACQCWebService/Service.asmx) • Query generation (2 types: TFIDF and chunk based) (http://shadow.racai.ro/QADWebService/Service.asmx) • Search engine interrogation (http://www.racai.ro/webservices/search.asmx) • Paragraph relevance score computation and paragraph reordering
The combined QA system • In order to account for NOA strings (which, when given, will increase the overall performance measure) we decided to combine 2 results: • The QA system using the TFIDF query • The QA system using the chunk query • When the same paragraph was returned among the top K (=3) paragraphs by the two systems, it was the answer • For the other case, we returned the NOA string
Paragraph relevance • s1 to s5 are paragraph relevance scores • λi are trained weights by iteratively computing MRR scores on a 200 questions test set using sets of weights for which the sum is 1. • Retaining the value of the weights that account for the largest obtained MRR, results in a MERT-like training procedure • Increment step was 0.01
Relevance scores • Lucene scores for the document and paragraph retrieval • One BLUE-like relevance score which is high if a candidate paragraph contains keywords much in same order as in the question • One indicator variable that is 1 if the candidate paragraph has the same class as the question (0 otherwise) • One lexical chains based score (a real number quantified semantic distance between the question and the candidate paragraph)
Evaluations • Official results • Second run: query contained the question class
Post CLEF2009 Evaluations • Results with all questions (500) answered (no NOA strings) • With trained parameters for every question class, we obtain an overall accuracy of 0.5774 (29 additional correctly answered questions)
Post CLEF2009 Evaluations (II) • Some other informative measures: • Answering precision: correct / answered • Rejection precision: (1 – correct) / unanswered • AP(icia092roro) = 75.58% • RP(icia092roro) = 86.53% • So, the system is able to reject giving wrong answers at a high rate which is a merit in itself (discovered due to the c@1 calculus) even if it cannot offer the same answering precision in the unanswered area
Conclusions • A multi-factored QA system may be easily extended with new paragraph relevance scores • It’s also easily adaptable on new domains and/or languages • Update: better correlation between documents and paragraph relevance scores • Future plans: to develop the English QA system along the same lines and combine the En-Ro outputs
Conclusions (II) • Competition drives innovation but let’s not forget that these tools are there to help users • Useful requirement: QA systems to be on the Web • Ours is at http://www2.racai.ro/sir-resdec/