information extraction

information extraction SNITA SARAWAGI

Management of Information Extraction System • Performance Optimization • Handling Change • Integration of Extracted Information • Imprecision of Extraction

Performance Optimization • Two modes of extraction system • The unstructured source is naturally available • The unstructured source is open-ended and large

Document Selection Strategies • When the source is really large… • manually restricting the set • Focused crawling • Searching via keyword

Document Selection Strategies • Index Search Techniques • Standard IR-style keyword queries “vaccine” and “cure” • Pattern queries “Thomas w+ Edison” > “Thomas NEAR Edison” • Index Design for Efficient Extraction • Index should support: proximity queries, regular expression patterns, and storate of tags • Cafarella and Etzioni [38]

Efficient Querying of Entity Database for Extraction • Similarity • Can be an expensive operation • E.g. extracting book titles from Bolgs. • Batch-Top-K search • Goal: to find each possible segment in x whose similarity to an entry in D(database of entities) is greater than a threshold ε • Concentrated on the TF-IDF similarity score • Chandel et al. [49]

Handling Change • Incremental Extractions on Changing Sources • “An easy optimization, with clear scope for performance boost, is to run the extractor only on the changed portions of a page instead of the entire page.” • Detecting the unchanged regions of the page is the key.

Handling Change • Detecting When Extractors Fail on Evolving Data • Defining Characteristic Patterns • DataProg • When the pattern’s frequency is statistically significant • Avoiding choosing very specific patterns • E.g. 4676 Admiralty Way 10924 Pico Boulevard 512 Oak Street 2431 Main Street 5257 Adams Boulevard P1: Number UpperCaseWord Boulevard P2: Number UpperCaseWord Street P3: Number UpperCaseWord Way

Handling Change • Detecting When Extractors Fail on Evolving Data • Defining Significant Change • The distribution represented by Fi’is said to be statistically different from Fi, if the expected values ei’ of counts in (D’,S’) obtained by extrapolated from Fi, differs a lot from Fi’ (using X2 statistics) = • The expected value:

Integration of Extracted Information • Main Challenge in Integration of Extracted Information: • Deduplication, coreference resolution, record linkage • Solution • “Ideally, extraction of all repeated mentions should be done simultaneously and jointly with integration with existing sources.” • Decoupled Extractions and Integration • Decoupled Extraction and Collective Integration • Coupled Extraction and Integration

Integration of Extracted Information • Decoupled Extractions and Integration • Extraction and integration are happened independently • Decision of redundancy is made during integration • Binary classification • Input: a pair of records output: binary decision(duplicate or not) • uses similarity function(cosine similarity, edit distance, Jaccard similarity, and Soundex)

Integration of Extracted Information • Decoupled Extractions and Integration • Example of a decision tree created on similarity function

Integration of Extracted Information • Decoupled Extractions and Integration • Sequential Process • An exracted record r and each entry e in the existing database D are applied by the classifier on the pair (r, e) and get “yes/no” • If the answer is no for all entries, r is a new entry, if not it is integrated with the best matching entry e. • Sequential process can be sped up considerably through index lookups for efficiently finding likely matches.

Integration of Extracted Information • Decoupled Extraction and Collective Integration R1. Alistair MacLean R2. A McLean R3. Alistair Mclean • Transitivity • If A = B and B = C than A = C • Cast the collective integration of multiple records as a graph partitioning problem • An edge between ei and ej is drawn with weighted score wij • Nodes: records The sign: duplicate or nonduplicate • Magnitude: confidence in this outcome

Integration of Extracted Information • Decoupled Extraction and Collective Integration • Correlation Clustering (CC)

Integration of Extracted Information • Decoupled Extraction and Collective Integration • Collective Multi-attribute Integration • When the information extracted spans multiple columns, it can have a greater impact

Integration of Extracted Information • Coupled Extraction and Integration • Joint extraction and integration • Little to be gained • when the database is not guaranteed to be complete • When we are extracting single entities at a time • Boost accuracy • When extracting records or multi-way relationships consisting of multiple entity subtypes

Integration of Extracted Information • Coupled Extraction and Integration E.g. “In his foreword to Transaction Processing Concepts and Tech- niques, Bruce Lindsay” Existing Books database • Book names where one of the entries is “Transaction Pro- cessing: Concepts and Techniques. “ • People names consisting of entries like “A. Reuters”, “J. Gray”, “B. Lindsay”, “D Knuth”, and so on. • Authors table linking the book titles with the people who wrote them.

Imprecision of Extraction • Confidence Values for Single Extractions • Two ways of representing the imprecision of extraction • To associate each extracted information with probability value • To output multiple possible extractions

Imprecision of Extraction • Confidence Values for Single Extractions • Reliability Plot • A useful visual tool to measure the soundness of the probabilities • X-axis: binned probabilities output by classifier • Y-axis: fraction of test instances in that probability bin whose predictions are correct

Imprecision of Extraction • Multi-attribute Extractions • Extracting multiple attributes of an entity from a single source string E.g. “52-A Goregaon West Mumbai 400 076” • Representing uncertainty through a probability distribution attached to each cloumn.

Imprecision of Extraction • Multi-attribute Extractions • Extracting multiple attributes of an entity from a single source string E.g. “52-A Goregaon West Mumbai 400 076” • Hybrid method: row and column level distributions

Imprecision of Extraction • Multiple Redundant Extractions

Imprecision of Extraction • Multiple Redundant Extractions • Assume only extraction uncertainty and ignore co-reference uncertainty by assuming that an exact method exists for resolving if two strings are the same. • Assume there is only co-reference uncertainty and each string has no uncertainty attached to it referring to an entity.

Imprecision of Extraction • Multiple Redundant Extractions • The Noisy-OR Model: to convert this p1,…,pn into a single probability value p of x • It assumes that different extractions are independent: not practical • 0.1 confidence * 100 => close to 1 • The soft-OR function

Imprecision of Extraction • Multiple Redundant Extractions • Conditional Probability Models from Labeled Data • Based on labeled data • No assumption about independency • Generative Models for Unlabeled Data • a single pattern • Multiple patterns: Pr(y|n1j…nkj) =

information extraction

information extraction

Presentation Transcript

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction