180 likes | 332 Views
Priberam’s question answering system for Portuguese. Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto. Summary. Introduction A workbench for NLP Lexical resources Software tools Question categorization System description Indexing process
E N D
Priberam’squestion answering systemfor Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto
Summary • Introduction • A workbench for NLP • Lexical resources • Software tools • Question categorization • System description • Indexing process • Question analysis • Document retrieval • Sentence retrieval • Answer extraction • Evaluation & Results • Conclusions
Introduction • Goal: to build a question answering (QA) engine that finds a unique exact answer for NL questions. • Evaluation: QA@CLEF Portuguese monolingual task. • Previous work by Priberam on this subject: • LegiX – a juridical information system • SintaGest – a workbench for NLP • TRUST project (Text Retrieval Using Semantics Technology) – development of the Portuguese module in a cross-language environment.
Lexical resources • Lexicon: • Lemmas, inflections and POS; • Sense definitions (*); • Semantic features, subcategorization and selection restrictions; • Ontological and terminological domains; • English and French equivalents (*); • Lexical-semantic relations (e.g. derivations). (*) Not used in the QA system. • Thesaurus • Ontology: • Multilingual (**) (English, French, Portuguese) – enables translations; • Designed by Synapse Développement for TRUST (**) Only Portuguese information is used in the QA system.
Software tools • Priberam’s SintaGest – a NLP application that allows: • Building & testing a context-free grammar (CFG); • Building & testing contextual rules for: • Morphological disambiguation; • Named entity & fixed expressions recognition; • Building & testing patterns for question categorization/answer extraction; • Compressing & compiling all data into binary files. • Statistical POS tagger: • Used together w/ contextual rules for morphological disambiguation; • HMM-based (2nd order), trained with the CETEMPublico corpus; • Fast & efficient performance => Viterbi algorithm.
Question categorization (I) • 86 question categories, flat structure • <DENOMINATION>, <DATE OF EVENT>, <TOWN NAME>, <BIRTH DATE>, <FUNCTION>, … • Categorization: performed through “rich” patterns (more powerful than regular expressions) • More than one category is allowed (avoiding hard decisions); • “Rich” patterns are conditional expressions w/ words (Word), lemmas (Root), POS (Cat), ontology entries (Ont), question identifiers (QuestIdent), and constant phrases; • Everything built & tested through SintaGest.
Question categorization (II) • There are 3 kinds of patterns: • Question patterns (QPs): for question categorization. • Answer patterns (APs): for sentence categorization (during indexation). • Question answering patterns (QAPs): for answer extraction. Heuristic scores Question (FUNCTION) : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15 // e.g. ‘‘Quem é Jorge Sampaio?’’ : Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15 // e.g. ‘‘Que cargo desempenha Jorge Sampaio?’’ Answer : Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20 // e.g. ‘‘Jorge Sampaio é o {Presidente da República}...’’ : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15 // e.g. ‘‘O {presidente da República}, Jorge Sampaio...’’ ; Answer (FUNCTION) : QuestIdent(FUNCTION_N) = 10 : Ergonym = 10 ; QPs QAPs APs
QA system overview • The system architecture is composed by 5 major modules:
Indexing process • The collection of target documents is analysed (off-line) and information is stored in a index database. • Each document first feeds the sentence analyser; • Sentence categorization: each sentence is classified with one or more question categories through the APs. • We build indices for: • Lemmas • Heads of derivation • NEs and fixed expressions • Question categories • Ontology domains (at document level)
Question analysis • Input: • A NL question (e.g. “Quem é o presidente da Albânia?”) • Procedure: • Sentence analysis • Question categorization & activation of QAPs (through the QPs) • Extraction of pivots (words, NEs, phrases, dates, abbreviations, …) • “Query expansion” (heads of derivation & synonyms) • Output: • Pivots’ lemmas, heads & synonyms (e.g. presidente, Albânia, presidir,albanês,chefe de estado) • Question categories (e.g. <FUNCTION>, <DENOMINATION>) • Relevant ontological domains • Active QAPs
Document retrieval • Input: • Pivots’ lemmas (wLi), heads (wHi) & synonyms (wSij) • Question categories (ck) & ontological domains (ol) • Procedure: • Word weighting (w) according to: • POS; • ilf (inv. lexical freq.); • idf (inv. docum. freq.). • Each document d is given a scored: • Output: • The top 30 scored documents. d := 0 For Each pivot i Ifd contains lemma wLiThen d += KL(wLi) Else Ifd contains head wHi Then d += KH(wHi) Else Ifd contains any synonym wSij Then d += maxj(KS (wSij, wLi) (wSij)) Ifd contains any question category ck Then d += KC Ifd contains any ontology domain ol Then d += KO d := RewardPivotProximity(d, d)
Sentence retrieval • Input: • Scored documents {(d, d)} w/ relevant sentences marked. • Procedure: • Sentence analysis • Sentence scoring – Each sentence s is given a score s according to: • Output: • Scored sentences {(s, s)} above a fixed threshold. • # pivots’ lemmas, heads & synonyms matching s; • # partial matches: Fidel↔Fidel Castro; • Order & proximity of pivots in s; • Existence of common question categories between q and s; • Score d of document d containing s.
Answer extraction • Input: • Scored sentences {(s, s)} • Active QAPs (from the Question Analysis module) • Procedure: • Answer extraction & scoring – through the QAPs • Answer coherence • Each answer a is rescored to a taking into account its coherence to the whole collection of candidate answers (e.g., “Sali Berisha”, “Ramiz Alia”, “Berisha”) • Selection of the final answer. e.g. “O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando que não está provado que o Governo grego esteja envolvido no ataque.” • Output: • The answer a with highest a or ‘NIL’ if none answer was extracted.
Results & evaluation (I) • QA@CLEF evaluation: • Portuguese monolingual task • 210734 target documents (~564 Mb) from Portuguese & Brazilian newspaper corpora: Público1994, Público1995, Folha1994, Folha1995 • Test set of 200 questions (in Brazilian and European Portuguese). • Results • 64,5% of right answers (R):
Results & evaluation (II) • Reasons for bad answers (W+X+U):
Conclusions • Priberam’s QA system exhibited encouraging results: • State-of-the-art accuracy (64.5%) in QA@CLEF evaluation • Possible advantages over other systems: • Adjustable & powerful patterns for categorization & extraction (SintaGest) • Query expansion through heads of derivation & synonyms • Use of ontology to introduce semantic knowledge • Some future work: • Confidence measure for final answer validation • Handling of list-, how-, & temporally-restricted questions • Semantic disambiguation & further exploiting of the ontology • Syntactical parsing & anaphora resolution • Refinement for Web & book searching
Priberam’squestion answering systemfor Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto
Ontology • Concept-based • Tree-structured, 4 levels • Nodes are concepts • Leaves are senses of words • Words are translated in several languages (English, French, Portuguese, Italian, Polish, and soon Spanish and Czech) • There are 3387 terminal nodes (the most specific concepts)