1 / 35

Summary

Summary. Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators Translingual architectures Information integration in MIE systems Evaluation Adaptability. Multilinguality. Introduction. Multilingual IE (MIE) tasks:

hoshiko
Download Presentation

Summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summary • Information Extraction Systems • Multilinguality • Introduction • Language guessers • Machine Translators • Translingual architectures • Information integration in MIE systems • Evaluation • Adaptability Adaptive Information Extraction

  2. Multilinguality Introduction • Multilingual IE (MIE) tasks: • The textual information contained in the output templates is wanted to be presented in a different language than the input documents • Tipically: • input documents written in one language • output templates written in another one Adaptive Information Extraction

  3. Multilinguality Introduction • Relatively little research in MIE • LRE program in Europe • ECRAN, FACILE, AVENTINUS, SPARKLE, … • tools and components for IE in different languages • TIDES program in USA • PROTEUS, RIPTIDES, CREST, … • fast machine translation and information access Adaptive Information Extraction

  4. Open research line Multilinguality Introduction • Up to now Multilingual IE evaluation just for NE tasks. Two recent scenarios: • CoNLL 2002-2003: • Language-independent NE recognition • ACE 2007: • Arabic input documents • English output NE mentions • Fei Huang (2005). Multilingual NE Extraction and Translation from text and speech. PhD. Thesis Adaptive Information Extraction

  5. Multilinguality Introduction • Basic elements of MIE architectures: • language guessers • monolingual architectures • Classical approches: • use of Machine Translation with monolingual IE architectures • extension of monolingual architectures to translingual architectures Adaptive Information Extraction

  6. Multilinguality Introduction • Basic elements of MIE architectures: • language guessers • monolingual architectures • Classical approches: • use of Machine Translation with monolingual IE architectures • extension of monolingual architectures to translingual architectures Adaptive Information Extraction

  7. Summary • Information Extraction Systems • Multilinguality • Introduction • Language guessers • Machine translators • Translingual architectures • Information integration in MIE systems • Evaluation • Adaptability Adaptive Information Extraction

  8. Multilinguality Language guessers • Goal: identify the language of a document • Linguistic approach: • based on a vocabulary of keywords • idea: at least one word from a tipical sentence written in some language should be included in the corresponding vocabulary • manually built Adaptive Information Extraction

  9. Multilinguality Language guessers • Stochastic approach: • most widely used • based on: • generate a frequency table of elements per language • compare frequencies of elements in the document with those in the table. • elements = or special characters or word sequences or char sequences • (different approaches) Adaptive Information Extraction

  10. Multilinguality Language guessers • Stochastic approach: • Pros: good results (over 95% accuracy) • Cons: short texts • [Zhdanova,02] copes with this problem Adaptive Information Extraction

  11. Summary • Information Extraction Systems • Multilinguality • Introduction • Language guessers • Machine Translators • Translingual architectures • Information integration in MIE systems • Evaluation • Adaptability Adaptive Information Extraction

  12. templates Multilinguality Machine translators • A set of monoligual IE systems MIE IE (s1) mt (s1,t) si t IE (s2) mt (s2,t) Language guesser . . . . . . mt (sk,t) IE (sk) Adaptive Information Extraction

  13. templates Multilinguality Machine translators • Just one monoligual IE system MIE mt (t’,t) MT (s1,t’) si t mt (t’,t) MT (s2,t’) Language guesser IE (t’) . . . . . . mt (t’,t) MT (sk,t’) Adaptive Information Extraction

  14. Summary • Information Extraction Systems • Multilinguality • Introduction • Language guessers • Machine Translators • Translingual architectures • Information integration in MIE systems • Evaluation • Adaptability Adaptive Information Extraction

  15. Multilinguality Translingual architectures • Try to overcome the ineficiency of the MIE architectures based on MT • Merging of IE and interlingua MT • Idea: when dealing with a particular domain, it is possible to build a language-independent conceptual model of the particular scenario of extraction [Gaizauskas et al. 97] Adaptive Information Extraction

  16. Multilinguality Translingual architectures • For each source language requires: • Use of different lexical preprocessors • Use of different syntactico-semantic parsing • Use of different sets of IE patterns (if the MIE system is based on pattern matching) • Possible use of language-independent processors (e.g., NERC) Adaptive Information Extraction

  17. Multilinguality Translingual architectures • Use of language-independent ontology • The internal representation of the extracted information is language independent • Use of soft techniques for NL generation • The output templates are generated using the lexicon of the target language • lexical choice problem! Adaptive Information Extraction

  18. Multilinguality Translingual architectures • M-LASIE system [Gaizauskas et. al 97] • Ad-hoc representation of the domain model • Lexicons mapped to concepts • Add a new source language, involves • Add new lexicon + mappings • Add new tagger and parser • … Adaptive Information Extraction

  19. Multilinguality Translingual architectures • M-TURBIO system [Turmo et. al 99] • EuroWordNet (EWN) • Sets of IE-patterns for each source language • Mappings from IE-patterns to ILIs in EWN • Add a new source language, involves • Add new IE-patterns • Add new tagger and parser • … Adaptive Information Extraction

  20. Summary • Information Extraction Systems • Multilinguality • Introduction • Language guessers • Machine Translators • Translingual architectures • Information integration in MIE systems • Evaluation • Adaptability Adaptive Information Extraction

  21. Multilinguality Information Integration in MIEs • The most general architecture • Input documents in different source languages not aligned • Output templates in different target languages • Possible approaches: • MIE system + II system • MIE/II system Adaptive Information Extraction

  22. Multilinguality Information Integration in MIEs • Pros: • Versatil • An instance can occur just in one document written in a specific language. • Can be easier to extract an instance expressed in one language than another • better processors or resources • Cons: • Problems inherent to II • inconsistent values, similar values, generalizations, … Adaptive Information Extraction

  23. Summary • Information Extraction Systems • Multilinguality • Evaluation • Introduction • Metrics • Data sets • Adaptability Adaptive Information Extraction

  24. What does correctly extracted means? What are the right metrics? What are the best data sets? Evaluation Introduction • The evaluation of the performance of an IE system depends on different factors: • The IE task: domain, language, document style, … • The user needs: software use, human use, just some clues about the relevant facts, the context in which they occur, … Adaptive Information Extraction

  25. Exact extraction The president of ALP in Spain will leave his job tomorrow night NP NP The president of ALP in Spain will leave his job tomorrow night NP ? Exact extraction The president of ALP in Spain will leave his job tomorrow night NP The president of ALP in Spain will leave his job tomorrow night NP ? Evaluation Introduction Adaptive Information Extraction

  26. Summary • Information Extraction Systems • Multilinguality • Evaluation • Introduction • Metrics • Data sets • Adaptability Adaptive Information Extraction

  27. Evaluation Metrics • Different evaluation frameworks with different points of view of what is correctly extracted: • MUC: • correct = partial extraction (-MUC5) • correct = exact extraction (MUC6, MUC7) • Recall, Precision and F (c.f., Historical Framework) • PASCAL: • correct = exact extraction • Same metrics as in MUC6 • ACE: • correct = partial extraction (more sophisticated than MUC) Adaptive Information Extraction

  28. Evaluation Metrics • ACE metric • Idea: How well match the information extracted • by a system with that of the reference model? • Given a system output, s, and a reference model, m, find the global optimum of function Value(s,m) that maximizes the matchings between instances in s and instances in m Adaptive Information Extraction

  29. Σ i • Σ j Value(token) = Element_value(token) * Argument_value(token) Evaluation Metrics • ACE metric • Value(s,m) = Value(sys_tokeni) / Value(ref_tokenj) token = instance extracted = [attributes, args or mentions] • Penalties: unmapped attributes, unmapped arguments, wrong mappings • Parameters: weights for penalties Adaptive Information Extraction

  30. Evaluation Metrics • ACE metric • Software for ACE evaluation and more information on ACE evaluation available in • http://www.nist.gov/speech/tests/ace Adaptive Information Extraction

  31. Summary • Information Extraction Systems • Multilinguality • Evaluation • Introduction • Metrics • Data sets • Adaptability Adaptive Information Extraction

  32. Evaluation Data sets • Ad-hoc • State of the art (e.g., from MUC, ACE, PASCAL) • Each one appropriated to evaluate different IE tasks, depending on different factors • Availability ? • Suitability ? Adaptive Information Extraction

  33. Evaluation Data sets: MUC • Sources: • free text written text (Newswire) • MUC-6 and MUC7 data sets • Suitable tasks: • NE subtasks • Element Extraction tasks (template element –TE) • Event Extraction tasks (scenario template -ST) • Relation Extraction tasks are quite easy • Language: English • Available from LDC (Linguistic Data Consortium) • http://www.ldc.upenn.edu Adaptive Information Extraction

  34. Evaluation Data sets: ACE • Sources: • Free textwritten text (Newswires, Weblogs, Discussion Forums) • Free textoral transcripts (Broadcast News, Telph. conversations) • Suitable tasks (up to now): • NE subtasks (extended from MUC) • Relation Extraction tasks • Event Extraction tasks need more annotation efforts • Language: English , Arabic, Chinese, Spanish depending on the input source • Available from LDC (Linguistic Data Consortium) • http://www.ldc.upenn.edu Adaptive Information Extraction

  35. Evaluation Data sets: PASCAL • Sources: • Semi-structure documents (Seminar announcements, Corporate acquisitions, Legal sentences) • Suitable tasks (up to now): • Element Extraction tasks • Language: English, Italian • Available from • http://nlp.shef.ac.uk/dot.kom/resources.html • Similar sources in repository RISE • http://www.isi.edu/info-agents/RISE/index.html Adaptive Information Extraction

More Related