1 / 18

Priberam’s question answering system for Portuguese

Priberam’s question answering system for Portuguese. Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto. Summary. Introduction A workbench for NLP Lexical resources Software tools Question categorization System description Indexing process

ananda
Download Presentation

Priberam’s question answering system for Portuguese

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Priberam’squestion answering systemfor Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto

  2. Summary • Introduction • A workbench for NLP • Lexical resources • Software tools • Question categorization • System description • Indexing process • Question analysis • Document retrieval • Sentence retrieval • Answer extraction • Evaluation & Results • Conclusions

  3. Introduction • Goal: to build a question answering (QA) engine that finds a unique exact answer for NL questions. • Evaluation: QA@CLEF Portuguese monolingual task. • Previous work by Priberam on this subject: • LegiX – a juridical information system • SintaGest – a workbench for NLP • TRUST project (Text Retrieval Using Semantics Technology) – development of the Portuguese module in a cross-language environment.

  4. Lexical resources • Lexicon: • Lemmas, inflections and POS; • Sense definitions (*); • Semantic features, subcategorization and selection restrictions; • Ontological and terminological domains; • English and French equivalents (*); • Lexical-semantic relations (e.g. derivations). (*) Not used in the QA system. • Thesaurus • Ontology: • Multilingual (**) (English, French, Portuguese) – enables translations; • Designed by Synapse Développement for TRUST (**) Only Portuguese information is used in the QA system.

  5. Software tools • Priberam’s SintaGest – a NLP application that allows: • Building & testing a context-free grammar (CFG); • Building & testing contextual rules for: • Morphological disambiguation; • Named entity & fixed expressions recognition; • Building & testing patterns for question categorization/answer extraction; • Compressing & compiling all data into binary files. • Statistical POS tagger: • Used together w/ contextual rules for morphological disambiguation; • HMM-based (2nd order), trained with the CETEMPublico corpus; • Fast & efficient performance => Viterbi algorithm.

  6. Question categorization (I) • 86 question categories, flat structure • <DENOMINATION>, <DATE OF EVENT>, <TOWN NAME>, <BIRTH DATE>, <FUNCTION>, … • Categorization: performed through “rich” patterns (more powerful than regular expressions) • More than one category is allowed (avoiding hard decisions); • “Rich” patterns are conditional expressions w/ words (Word), lemmas (Root), POS (Cat), ontology entries (Ont), question identifiers (QuestIdent), and constant phrases; • Everything built & tested through SintaGest.

  7. Question categorization (II) • There are 3 kinds of patterns: • Question patterns (QPs): for question categorization. • Answer patterns (APs): for sentence categorization (during indexation). • Question answering patterns (QAPs): for answer extraction. Heuristic scores Question (FUNCTION) : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15 // e.g. ‘‘Quem é Jorge Sampaio?’’ : Word(que) QuestIdent(FUNCTION_N) Distance(0,3) QuestIdent(FUNCTION_V) = 15 // e.g. ‘‘Que cargo desempenha Jorge Sampaio?’’ Answer : Pivot & AnyCat (Nprop, ENT) Root(ser) {Definition With Ergonym?} = 20 // e.g. ‘‘Jorge Sampaio é o {Presidente da República}...’’ : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15 // e.g. ‘‘O {presidente da República}, Jorge Sampaio...’’ ; Answer (FUNCTION) : QuestIdent(FUNCTION_N) = 10 : Ergonym = 10 ; QPs QAPs APs

  8. QA system overview • The system architecture is composed by 5 major modules:

  9. Indexing process • The collection of target documents is analysed (off-line) and information is stored in a index database. • Each document first feeds the sentence analyser; • Sentence categorization: each sentence is classified with one or more question categories through the APs. • We build indices for: • Lemmas • Heads of derivation • NEs and fixed expressions • Question categories • Ontology domains (at document level)

  10. Question analysis • Input: • A NL question (e.g. “Quem é o presidente da Albânia?”) • Procedure: • Sentence analysis • Question categorization & activation of QAPs (through the QPs) • Extraction of pivots (words, NEs, phrases, dates, abbreviations, …) • “Query expansion” (heads of derivation & synonyms) • Output: • Pivots’ lemmas, heads & synonyms (e.g. presidente, Albânia, presidir,albanês,chefe de estado) • Question categories (e.g. <FUNCTION>, <DENOMINATION>) • Relevant ontological domains • Active QAPs

  11. Document retrieval • Input: • Pivots’ lemmas (wLi), heads (wHi) & synonyms (wSij) • Question categories (ck) & ontological domains (ol) • Procedure: • Word weighting (w) according to: • POS; • ilf (inv. lexical freq.); • idf (inv. docum. freq.). • Each document d is given a scored: • Output: • The top 30 scored documents. d := 0 For Each pivot i Ifd contains lemma wLiThen d += KL(wLi) Else Ifd contains head wHi Then d += KH(wHi) Else Ifd contains any synonym wSij Then d += maxj(KS (wSij, wLi) (wSij)) Ifd contains any question category ck Then d += KC Ifd contains any ontology domain ol Then d += KO d := RewardPivotProximity(d, d)

  12. Sentence retrieval • Input: • Scored documents {(d, d)} w/ relevant sentences marked. • Procedure: • Sentence analysis • Sentence scoring – Each sentence s is given a score s according to: • Output: • Scored sentences {(s, s)} above a fixed threshold. • # pivots’ lemmas, heads & synonyms matching s; • # partial matches: Fidel↔Fidel Castro; • Order & proximity of pivots in s; • Existence of common question categories between q and s; • Score d of document d containing s.

  13. Answer extraction • Input: • Scored sentences {(s, s)} • Active QAPs (from the Question Analysis module) • Procedure: • Answer extraction & scoring – through the QAPs • Answer coherence • Each answer a is rescored to a taking into account its coherence to the whole collection of candidate answers (e.g., “Sali Berisha”, “Ramiz Alia”, “Berisha”) • Selection of the final answer. e.g. “O Presidente da Albânia, Sali Berisha, tentou evitar o pior, afirmando que não está provado que o Governo grego esteja envolvido no ataque.” • Output: • The answer a with highest a or ‘NIL’ if none answer was extracted.

  14. Results & evaluation (I) • QA@CLEF evaluation: • Portuguese monolingual task • 210734 target documents (~564 Mb) from Portuguese & Brazilian newspaper corpora: Público1994, Público1995, Folha1994, Folha1995 • Test set of 200 questions (in Brazilian and European Portuguese). • Results • 64,5% of right answers (R):

  15. Results & evaluation (II) • Reasons for bad answers (W+X+U):

  16. Conclusions • Priberam’s QA system exhibited encouraging results: • State-of-the-art accuracy (64.5%) in QA@CLEF evaluation • Possible advantages over other systems: • Adjustable & powerful patterns for categorization & extraction (SintaGest) • Query expansion through heads of derivation & synonyms • Use of ontology to introduce semantic knowledge • Some future work: • Confidence measure for final answer validation • Handling of list-, how-, & temporally-restricted questions • Semantic disambiguation & further exploiting of the ontology • Syntactical parsing & anaphora resolution • Refinement for Web & book searching

  17. Priberam’squestion answering systemfor Portuguese Carlos Amaral, Helena Figueira, André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto

  18. Ontology • Concept-based • Tree-structured, 4 levels • Nodes are concepts • Leaves are senses of words • Words are translated in several languages (English, French, Portuguese, Italian, Polish, and soon Spanish and Czech) • There are 3387 terminal nodes (the most specific concepts)

More Related