1 / 96

Information Extraction

Information Extraction. Jordi Turmo TALP Research Centre Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya turmo@lsi.upc.edu http://www.lsi.upc.edu/~turmo. Summary. Information Extraction Systems Evaluation Multilinguality Adaptability. Summary.

sherwood
Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Jordi Turmo TALP Research Centre Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya turmo@lsi.upc.edu http://www.lsi.upc.edu/~turmo Adaptive Information Extraction

  2. Summary • Information Extraction Systems • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction

  3. Summary • Information Extraction Systems • Introduction • Historical framework • Architecture • Knowledge specific for IE • Examples • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction

  4. Introduction Definition • Goal: Localization and extraction, in a specific format, of the relevant information included in a collection of documents • Input requirements: scenario of extraction and document collection • Output requirements: output format Adaptive Information Extraction

  5. Introduction Typology • Different points of view: • conceptual coverage: restricted-domain IE vs. open-domain IE • language coverage: monoligual IE vs. multilingual IE • media coverage: written text IE, speech IE, image IE, multimedia IE • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML) Adaptive Information Extraction

  6. Introduction Typology • Different points of view: • conceptual converage: restricted-domain IEvs. open-domain IE • language coverage: monoligual IEvs. multilingual IE • media coverage:written text IE, speech IE, image IE, multimedia IE • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML) Adaptive Information Extraction

  7. Introduction Example 1: Structured documents • Web pages • A list of members of an organization per • document • English • Scenario of Extraction • Name, degree, school and affiliation of the member Adaptive Information Extraction

  8. Introduction Example 1: Structured documents Name Degree School Affiliation WL Hsu PhD Cornell IIS, Sinica CS Ho PhD NTU EE,NTIT C.Chen PhD SUNY EE,NTIT C.Wu PhD Utexas Cedu,NNU Mark Liao PhD NWU IIS, Sinica CJ Liau PhD NTU IIS, Sinica WK Cheng PhD TKU Tunghai WC Wang MS Syracus FIT ... Adaptive Information Extraction

  9. Introduction Example 2: Semi-structured documents • 485 seminar announcements • A description of one seminar per document • English • Scenario of Extraction • Speaker, location, start time and end time of the • seminar Adaptive Information Extraction

  10. Introduction Example 2: Semi-structured documents Adaptive Information Extraction

  11. Introduction Example 3: Free text • 318 Wall Street Journal articles • A description of an incident per document • English • Scenario of Extraction • Type of incident, perpetrator, target, date, location, • effects and instrument Adaptive Information Extraction

  12. A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualties have been reported. According to unofficial sources, the bomb -allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650. Introduction Example 3: Free text Incident type: bombing date: March 19 Location: El Salvador: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Human target: - Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb Adaptive Information Extraction

  13. Introduction Example 4: Free text • 78 documents • A description of mushroom per document • Spanish • Scenario of Extraction • colors of parts of mushrooms and the circumstances • in which they occur Adaptive Information Extraction

  14. Introduction Example 4: Free text Adaptive Information Extraction

  15. Introduction Example 4: Free text El color blanco de su sombrero pasa a amarillo crema al corte. El sombrero ennegrece si se corta. color_1 base: blanco tono: indef luz: indef Sombrero_1 color: virar_1 inicio: final: causa: corte color_2 base: amarillo tono: crema luz: indef Sombrero_2 color: virar_2 inicio: indef final: causa: corte color_3 base: indef tono: negro luz: indef Adaptive Information Extraction

  16. Introduction Example 5: Combination • 78 documents • A description of mushroom per document • Spanish • Scenario of Extraction • Names of the mushroom in different languages, ethimology • colors of parts of mushrooms and the circumstances • in which they occur Adaptive Information Extraction

  17. Introduction Example 5: Combination Adaptive Information Extraction

  18. Introduction Applications • IE from the Web • Building of news DBs • Information Integration • Support for QA and Summarization • … • Limitation whenP<80% Adaptive Information Extraction

  19. Introduction References • D.E. Appelt, D.J. Israel, 1999 • E. Hovy, 1999 • R.J. Mooney, C. Cardie, 1999 • Muslea, 1999 • J. Cowie, Y. Wilks, 2000 • M.T. Pazienza, 2003 • Turmo, 2003 • Turmo et al. 2005 Adaptive Information Extraction

  20. Introduction Recent events • IJCAI 2001 Workshop on Adaptive Text Extraction and Mining (ATEM-2001) • ECML 03/PKDD Workshop on Adaptive Text Extraction and Mining (ATEM-2003) • AAAI 04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004) • EACL 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006) • COLING-ACL 06 Workshop on Information Extraction Beyond the Document • ECAI 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006) Adaptive Information Extraction

  21. Summary • Information Extraction Systems • Introduction • Historical framework • Architecture • Knowledge specific for IE • Examples • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction

  22. Manual Process Experts on the Domain Relevant Information Historical framework Origin of IE • Acquisition of the relevant information involved in knowledge-based systems • Traditionally (High human cost) Adaptive Information Extraction

  23. Text-based Intelligent Systems Historical framework Origin of IE • Acquisition of the relevant information involved in knowledge-based systems • 80’s (text sources) Relevant Information Adaptive Information Extraction

  24. Historical framework Origin of IE • Text-Based Intelligent Systems (TBIS) • Information Retrieval • Information Integration • Information Filtering • Information Routing • Information Extraction • Document Classification • Question Answering • Automatic Summarization • Topic Detection & Tracking ... Adaptive Information Extraction

  25. Historical framework Relevant Historical Programs • Precedents: LSP (Sager, 81), FRUMP (DeJong, 82), • JASPER (Hayes, 86) • in USA • (1987-1991): MUC [US Navy] • TIPSTER (1991-1998): MUC [DARPA] • TIDES (1999-): ACE [NIST] • in Europe • LRE (1993-1996): TREE, AVENTINUS, FACILE, ECRAN, SPARKLE • PASCAL excellence network (2003-) Adaptive Information Extraction

  26. Historical framework MUC Evolution • MUC-1 (1987) • naval operations • auto-definition of scenarios • auto-evaluation • MUC-2 (1989) • naval operations • output structure with 10 attributes (type of event, agent, place, ...) • auto-evaluation Adaptive Information Extraction

  27. Historical framework MUC Evolution • MUC-3 (1991), • Latin-American terrorism • output structure with 18 attributes (type of incident, date, place, ...) • recall and precision measures a extracted = a + b + e + f relevant = a + f + d recall = a + 0.5 f/ (a + f + d) precision = a + 0.5 f/ (a + f + b + e) extracted f b e d c parcially extracted relevant Adaptive Information Extraction

  28. Historical framework MUC Evolution • MUC-4 (1992), • Latin-American terrorism • 24 attributes • F-score (harmonic average) • MUC-5 (1993), • Financial news, microelectronics • English, Japanese Adaptive Information Extraction

  29. Historical framework MUC Evolution • MUC-6 (1995), • finantial news • subtasks: NE, coreference • tasks: TE (template element), ST (scenario template) • MUC-7 (1998), • air crashes • new task: TR (template relation) Adaptive Information Extraction

  30. a extracted b d c relevant Historical framework MUC Evolution • MUC-6, MUC-7 • Partial extractions are discarded extracted = a + b relevant = a + d recall = a / (a + d) precision = a / (a + b) Adaptive Information Extraction

  31. Summary • Information Extraction Systems • Introduction • Historical framework • Architecture • Knowledge specific for IE • Examples • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction

  32. Architecture General Architecture • Hobbs,93: • Cascade of transducers (or modules) that add structure to text and, often, drop out irrelevant information by applying rules Adaptive Information Extraction

  33. Architecture Traditional Architecture Document Preprocessing Conceptual Hierarchy Pattern Matching Pattern Base Postprocess Adaptive Information Extraction

  34. Architecture Traditional Architecture Text Control Lexical Analysis ConceptualHierarchy Syntactic Analysis Pattern Matching Pattern Base Postprocess Adaptive Information Extraction

  35. Architecture Traditional Architecture Text Control Lexical Analysis Conceptual Hierarchy Syntactic Analysis Pattern Matching Pattern Base Discourse Analysis Output Template Generation Output Format Adaptive Information Extraction

  36. Architecture Architecture Text control • Filtering relevant documents • Guessing the language of the documents • Splitting documents into textual zones • Filtering relevant zones • Splitting text into appropriate units (eg. sentences) • Filtering relevant units • Tokenizing units Adaptive Information Extraction

  37. Architecture Architecture Text control • Example Adaptive Information Extraction

  38. Architecture Architecture Text control • Example <Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .> … <Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.> Adaptive Information Extraction

  39. Architecture Architecture Lexical analysis • Identifying morpho-syntactic categories and semantic categories of words • General lexicon • Recognizing terminology words • Specific dictionaries • Recognizing time expressions, quantities, abbreviations, … • Extending abbreviations • Lists of abbrev. + expansion Adaptive Information Extraction

  40. Architecture Architecture Lexical analysis • Recognizing and classifying proper nouns (Named Entities –NERC-) • Gazetteers • Patterns • Dealing with unknown words • Dealing with lexical ambiguities • POS taggers • WSD (???) Adaptive Information Extraction

  41. Architecture Architecture Lexical analysis • Example1 time expressions mushroom names abbreviatures numbers morphologic parts <Sombrero bastante carnoso de 4 a 8cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .> … <Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.> Depends on the scenario Adaptive Information Extraction

  42. Architecture Architecture Lexical analysis • Example2 <A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy , but no casualties have been reported .> <According to unofficial sources , the bomb-allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 .> time expressions locations organizations persons … Adaptive Information Extraction

  43. Architecture Architecture Syntactic analysis • Full parsing (Lolita, LaSIE, LaSIE-II) • inefficient, sizes of the grammars • missing robustness (off vocabulary) • treebank grammars • cascaded grammars • Solves some problems related to the tuning and incompleteness Adaptive Information Extraction

  44. Architecture Architecture Syntactic analysis • Partial parsing • the most commonly used • chunks or phrasal trees (noun phrases, verbal phrases, prep phrases, adj phrases, adv phrases) • absence of global dependences Adaptive Information Extraction

  45. Architecture Architecture Semantic interpretation • Compositive semantics • full parsing + λ-expressions • LaSIE, LaSIE-II • Entries with λ-expressions in the Lexicons • partial parsing + gramatical relations [Vilain,99] • output = logical forms Adaptive Information Extraction

  46. Architecture Architecture Semantic interpretation • Compositive semantics (example1) λ(z) λ(y) λ(x) (bombing(x,y,z,bomb,today_morning,power_tower(San_Salvador))) s vp pp np np np pp A bombwent offthis morning near a power tower in San Salvador … go_off → λ(t) λ(s) λ(r) λ(z) λ(y)λ(x) (bombing(x,y,z,r,s,t)) power_tower → λ(x) (power_tower(x)) Adaptive Information Extraction

  47. Architecture Architecture Semantic interpretation • Compositive semantics (example2) location_of place subj time A bombwent offthis morning near a power tower in San Salvador … event(bombing , E) subj(bomb , E) time(today_morning , E) place(power_tower, E) location_of(power_tower, San_Salvador) Adaptive Information Extraction

  48. Architecture Architecture Semantic interpretation • Pattern matching • after partial parsing + svo dependences • the most extended • patterns can be implemented in different ways • scenario driven approach (TE, TR, ST, …) • Output = partial templates Adaptive Information Extraction

  49. Architecture Architecture Semantic interpretation • Pattern matching (example) A bombwent offthis morning near a power tower in San Salvador … np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location) → INSTRUMENT := C-instrument DATE := C-time PHIS_TARGET := C-place LOCATION := C-location Adaptive Information Extraction

  50. Architecture Architecture Discourse analysis • Inter-sentence analysis • Co-reference resolution • Ellipsis resolution • Alias resolution • Traditional semantic interpretation procedures • Template merging procedures • Inference procedures • Open-domain and domain-specific knowledge for inferences Adaptive Information Extraction

More Related