960 likes | 1.16k Views
Information Extraction. Jordi Turmo TALP Research Centre Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya turmo@lsi.upc.edu http://www.lsi.upc.edu/~turmo. Summary. Information Extraction Systems Evaluation Multilinguality Adaptability. Summary.
E N D
Information Extraction Jordi Turmo TALP Research Centre Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya turmo@lsi.upc.edu http://www.lsi.upc.edu/~turmo Adaptive Information Extraction
Summary • Information Extraction Systems • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction
Summary • Information Extraction Systems • Introduction • Historical framework • Architecture • Knowledge specific for IE • Examples • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction
Introduction Definition • Goal: Localization and extraction, in a specific format, of the relevant information included in a collection of documents • Input requirements: scenario of extraction and document collection • Output requirements: output format Adaptive Information Extraction
Introduction Typology • Different points of view: • conceptual coverage: restricted-domain IE vs. open-domain IE • language coverage: monoligual IE vs. multilingual IE • media coverage: written text IE, speech IE, image IE, multimedia IE • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML) Adaptive Information Extraction
Introduction Typology • Different points of view: • conceptual converage: restricted-domain IEvs. open-domain IE • language coverage: monoligual IEvs. multilingual IE • media coverage:written text IE, speech IE, image IE, multimedia IE • document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML) Adaptive Information Extraction
Introduction Example 1: Structured documents • Web pages • A list of members of an organization per • document • English • Scenario of Extraction • Name, degree, school and affiliation of the member Adaptive Information Extraction
Introduction Example 1: Structured documents Name Degree School Affiliation WL Hsu PhD Cornell IIS, Sinica CS Ho PhD NTU EE,NTIT C.Chen PhD SUNY EE,NTIT C.Wu PhD Utexas Cedu,NNU Mark Liao PhD NWU IIS, Sinica CJ Liau PhD NTU IIS, Sinica WK Cheng PhD TKU Tunghai WC Wang MS Syracus FIT ... Adaptive Information Extraction
Introduction Example 2: Semi-structured documents • 485 seminar announcements • A description of one seminar per document • English • Scenario of Extraction • Speaker, location, start time and end time of the • seminar Adaptive Information Extraction
Introduction Example 2: Semi-structured documents Adaptive Information Extraction
Introduction Example 3: Free text • 318 Wall Street Journal articles • A description of an incident per document • English • Scenario of Extraction • Type of incident, perpetrator, target, date, location, • effects and instrument Adaptive Information Extraction
A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualties have been reported. According to unofficial sources, the bomb -allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650. Introduction Example 3: Free text Incident type: bombing date: March 19 Location: El Salvador: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Human target: - Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb Adaptive Information Extraction
Introduction Example 4: Free text • 78 documents • A description of mushroom per document • Spanish • Scenario of Extraction • colors of parts of mushrooms and the circumstances • in which they occur Adaptive Information Extraction
Introduction Example 4: Free text Adaptive Information Extraction
Introduction Example 4: Free text El color blanco de su sombrero pasa a amarillo crema al corte. El sombrero ennegrece si se corta. color_1 base: blanco tono: indef luz: indef Sombrero_1 color: virar_1 inicio: final: causa: corte color_2 base: amarillo tono: crema luz: indef Sombrero_2 color: virar_2 inicio: indef final: causa: corte color_3 base: indef tono: negro luz: indef Adaptive Information Extraction
Introduction Example 5: Combination • 78 documents • A description of mushroom per document • Spanish • Scenario of Extraction • Names of the mushroom in different languages, ethimology • colors of parts of mushrooms and the circumstances • in which they occur Adaptive Information Extraction
Introduction Example 5: Combination Adaptive Information Extraction
Introduction Applications • IE from the Web • Building of news DBs • Information Integration • Support for QA and Summarization • … • Limitation whenP<80% Adaptive Information Extraction
Introduction References • D.E. Appelt, D.J. Israel, 1999 • E. Hovy, 1999 • R.J. Mooney, C. Cardie, 1999 • Muslea, 1999 • J. Cowie, Y. Wilks, 2000 • M.T. Pazienza, 2003 • Turmo, 2003 • Turmo et al. 2005 Adaptive Information Extraction
Introduction Recent events • IJCAI 2001 Workshop on Adaptive Text Extraction and Mining (ATEM-2001) • ECML 03/PKDD Workshop on Adaptive Text Extraction and Mining (ATEM-2003) • AAAI 04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004) • EACL 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006) • COLING-ACL 06 Workshop on Information Extraction Beyond the Document • ECAI 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006) Adaptive Information Extraction
Summary • Information Extraction Systems • Introduction • Historical framework • Architecture • Knowledge specific for IE • Examples • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction
Manual Process Experts on the Domain Relevant Information Historical framework Origin of IE • Acquisition of the relevant information involved in knowledge-based systems • Traditionally (High human cost) Adaptive Information Extraction
Text-based Intelligent Systems Historical framework Origin of IE • Acquisition of the relevant information involved in knowledge-based systems • 80’s (text sources) Relevant Information Adaptive Information Extraction
Historical framework Origin of IE • Text-Based Intelligent Systems (TBIS) • Information Retrieval • Information Integration • Information Filtering • Information Routing • Information Extraction • Document Classification • Question Answering • Automatic Summarization • Topic Detection & Tracking ... Adaptive Information Extraction
Historical framework Relevant Historical Programs • Precedents: LSP (Sager, 81), FRUMP (DeJong, 82), • JASPER (Hayes, 86) • in USA • (1987-1991): MUC [US Navy] • TIPSTER (1991-1998): MUC [DARPA] • TIDES (1999-): ACE [NIST] • in Europe • LRE (1993-1996): TREE, AVENTINUS, FACILE, ECRAN, SPARKLE • PASCAL excellence network (2003-) Adaptive Information Extraction
Historical framework MUC Evolution • MUC-1 (1987) • naval operations • auto-definition of scenarios • auto-evaluation • MUC-2 (1989) • naval operations • output structure with 10 attributes (type of event, agent, place, ...) • auto-evaluation Adaptive Information Extraction
Historical framework MUC Evolution • MUC-3 (1991), • Latin-American terrorism • output structure with 18 attributes (type of incident, date, place, ...) • recall and precision measures a extracted = a + b + e + f relevant = a + f + d recall = a + 0.5 f/ (a + f + d) precision = a + 0.5 f/ (a + f + b + e) extracted f b e d c parcially extracted relevant Adaptive Information Extraction
Historical framework MUC Evolution • MUC-4 (1992), • Latin-American terrorism • 24 attributes • F-score (harmonic average) • MUC-5 (1993), • Financial news, microelectronics • English, Japanese Adaptive Information Extraction
Historical framework MUC Evolution • MUC-6 (1995), • finantial news • subtasks: NE, coreference • tasks: TE (template element), ST (scenario template) • MUC-7 (1998), • air crashes • new task: TR (template relation) Adaptive Information Extraction
a extracted b d c relevant Historical framework MUC Evolution • MUC-6, MUC-7 • Partial extractions are discarded extracted = a + b relevant = a + d recall = a / (a + d) precision = a / (a + b) Adaptive Information Extraction
Summary • Information Extraction Systems • Introduction • Historical framework • Architecture • Knowledge specific for IE • Examples • Evaluation • Multilinguality • Adaptability Adaptive Information Extraction
Architecture General Architecture • Hobbs,93: • Cascade of transducers (or modules) that add structure to text and, often, drop out irrelevant information by applying rules Adaptive Information Extraction
Architecture Traditional Architecture Document Preprocessing Conceptual Hierarchy Pattern Matching Pattern Base Postprocess Adaptive Information Extraction
Architecture Traditional Architecture Text Control Lexical Analysis ConceptualHierarchy Syntactic Analysis Pattern Matching Pattern Base Postprocess Adaptive Information Extraction
Architecture Traditional Architecture Text Control Lexical Analysis Conceptual Hierarchy Syntactic Analysis Pattern Matching Pattern Base Discourse Analysis Output Template Generation Output Format Adaptive Information Extraction
Architecture Architecture Text control • Filtering relevant documents • Guessing the language of the documents • Splitting documents into textual zones • Filtering relevant zones • Splitting text into appropriate units (eg. sentences) • Filtering relevant units • Tokenizing units Adaptive Information Extraction
Architecture Architecture Text control • Example Adaptive Information Extraction
Architecture Architecture Text control • Example <Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .> … <Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.> Adaptive Information Extraction
Architecture Architecture Lexical analysis • Identifying morpho-syntactic categories and semantic categories of words • General lexicon • Recognizing terminology words • Specific dictionaries • Recognizing time expressions, quantities, abbreviations, … • Extending abbreviations • Lists of abbrev. + expansion Adaptive Information Extraction
Architecture Architecture Lexical analysis • Recognizing and classifying proper nouns (Named Entities –NERC-) • Gazetteers • Patterns • Dealing with unknown words • Dealing with lexical ambiguities • POS taggers • WSD (???) Adaptive Information Extraction
Architecture Architecture Lexical analysis • Example1 time expressions mushroom names abbreviatures numbers morphologic parts <Sombrero bastante carnoso de 4 a 8cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .> … <Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.> Depends on the scenario Adaptive Information Extraction
Architecture Architecture Lexical analysis • Example2 <A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy , but no casualties have been reported .> <According to unofficial sources , the bomb-allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 .> time expressions locations organizations persons … Adaptive Information Extraction
Architecture Architecture Syntactic analysis • Full parsing (Lolita, LaSIE, LaSIE-II) • inefficient, sizes of the grammars • missing robustness (off vocabulary) • treebank grammars • cascaded grammars • Solves some problems related to the tuning and incompleteness Adaptive Information Extraction
Architecture Architecture Syntactic analysis • Partial parsing • the most commonly used • chunks or phrasal trees (noun phrases, verbal phrases, prep phrases, adj phrases, adv phrases) • absence of global dependences Adaptive Information Extraction
Architecture Architecture Semantic interpretation • Compositive semantics • full parsing + λ-expressions • LaSIE, LaSIE-II • Entries with λ-expressions in the Lexicons • partial parsing + gramatical relations [Vilain,99] • output = logical forms Adaptive Information Extraction
Architecture Architecture Semantic interpretation • Compositive semantics (example1) λ(z) λ(y) λ(x) (bombing(x,y,z,bomb,today_morning,power_tower(San_Salvador))) s vp pp np np np pp A bombwent offthis morning near a power tower in San Salvador … go_off → λ(t) λ(s) λ(r) λ(z) λ(y)λ(x) (bombing(x,y,z,r,s,t)) power_tower → λ(x) (power_tower(x)) Adaptive Information Extraction
Architecture Architecture Semantic interpretation • Compositive semantics (example2) location_of place subj time A bombwent offthis morning near a power tower in San Salvador … event(bombing , E) subj(bomb , E) time(today_morning , E) place(power_tower, E) location_of(power_tower, San_Salvador) Adaptive Information Extraction
Architecture Architecture Semantic interpretation • Pattern matching • after partial parsing + svo dependences • the most extended • patterns can be implemented in different ways • scenario driven approach (TE, TR, ST, …) • Output = partial templates Adaptive Information Extraction
Architecture Architecture Semantic interpretation • Pattern matching (example) A bombwent offthis morning near a power tower in San Salvador … np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location) → INSTRUMENT := C-instrument DATE := C-time PHIS_TARGET := C-place LOCATION := C-location Adaptive Information Extraction
Architecture Architecture Discourse analysis • Inter-sentence analysis • Co-reference resolution • Ellipsis resolution • Alias resolution • Traditional semantic interpretation procedures • Template merging procedures • Inference procedures • Open-domain and domain-specific knowledge for inferences Adaptive Information Extraction