310 likes | 491 Views
information extraction. SNITA SARAWAGI. Management of Information Extraction System. Performance Optimization Handling Change Integration of Extracted Information Imprecision of Extraction. Performance Optimization. Two modes of extraction system
E N D
information extraction SNITA SARAWAGI
Management of Information Extraction System • Performance Optimization • Handling Change • Integration of Extracted Information • Imprecision of Extraction
Performance Optimization • Two modes of extraction system • The unstructured source is naturally available • The unstructured source is open-ended and large
Document Selection Strategies • When the source is really large… • manually restricting the set • Focused crawling • Searching via keyword
Document Selection Strategies • When the source is really large… • manually restricting the set • Focused crawling • Searching via keyword
Document Selection Strategies • When the source is really large… • manually restricting the set • Focused crawling • Searching via keyword
Document Selection Strategies • When the source is really large… • manually restricting the set • Focused crawling • Searching via keyword
Document Selection Strategies • Index Search Techniques • Standard IR-style keyword queries “vaccine” and “cure” • Pattern queries “Thomas w+ Edison” > “Thomas NEAR Edison” • Index Design for Efficient Extraction • Index should support: proximity queries, regular expression patterns, and storate of tags • Cafarella and Etzioni [38]
Efficient Querying of Entity Database for Extraction • Similarity • Can be an expensive operation • E.g. extracting book titles from Bolgs. • Batch-Top-K search • Goal: to find each possible segment in x whose similarity to an entry in D(database of entities) is greater than a threshold ε • Concentrated on the TF-IDF similarity score • Chandel et al. [49]
Handling Change • Incremental Extractions on Changing Sources • “An easy optimization, with clear scope for performance boost, is to run the extractor only on the changed portions of a page instead of the entire page.” • Detecting the unchanged regions of the page is the key.
Handling Change • Detecting When Extractors Fail on Evolving Data • Defining Characteristic Patterns • DataProg • When the pattern’s frequency is statistically significant • Avoiding choosing very specific patterns • E.g. 4676 Admiralty Way 10924 Pico Boulevard 512 Oak Street 2431 Main Street 5257 Adams Boulevard P1: Number UpperCaseWord Boulevard P2: Number UpperCaseWord Street P3: Number UpperCaseWord Way
Handling Change • Detecting When Extractors Fail on Evolving Data • Defining Characteristic Patterns • DataProg • When the pattern’s frequency is statistically significant • Avoiding choosing very specific patterns • E.g. 4676 Admiralty Way 10924 Pico Boulevard 512 Oak Street 2431 Main Street 5257 Adams Boulevard P1: Number UpperCaseWord Boulevard P2: Number UpperCaseWord Street P3: Number UpperCaseWord Way
Handling Change • Detecting When Extractors Fail on Evolving Data • Defining Significant Change • The distribution represented by Fi’is said to be statistically different from Fi, if the expected values ei’ of counts in (D’,S’) obtained by extrapolated from Fi, differs a lot from Fi’ (using X2 statistics) = • The expected value:
Integration of Extracted Information • Main Challenge in Integration of Extracted Information: • Deduplication, coreference resolution, record linkage • Solution • “Ideally, extraction of all repeated mentions should be done simultaneously and jointly with integration with existing sources.” • Decoupled Extractions and Integration • Decoupled Extraction and Collective Integration • Coupled Extraction and Integration
Integration of Extracted Information • Decoupled Extractions and Integration • Extraction and integration are happened independently • Decision of redundancy is made during integration • Binary classification • Input: a pair of records output: binary decision(duplicate or not) • uses similarity function(cosine similarity, edit distance, Jaccard similarity, and Soundex)
Integration of Extracted Information • Decoupled Extractions and Integration • Example of a decision tree created on similarity function
Integration of Extracted Information • Decoupled Extractions and Integration • Sequential Process • An exracted record r and each entry e in the existing database D are applied by the classifier on the pair (r, e) and get “yes/no” • If the answer is no for all entries, r is a new entry, if not it is integrated with the best matching entry e. • Sequential process can be sped up considerably through index lookups for efficiently finding likely matches.
Integration of Extracted Information • Decoupled Extraction and Collective Integration R1. Alistair MacLean R2. A McLean R3. Alistair Mclean • Transitivity • If A = B and B = C than A = C • Cast the collective integration of multiple records as a graph partitioning problem • An edge between ei and ej is drawn with weighted score wij • Nodes: records The sign: duplicate or nonduplicate • Magnitude: confidence in this outcome
Integration of Extracted Information • Decoupled Extraction and Collective Integration • Correlation Clustering (CC)
Integration of Extracted Information • Decoupled Extraction and Collective Integration • Collective Multi-attribute Integration • When the information extracted spans multiple columns, it can have a greater impact
Integration of Extracted Information • Coupled Extraction and Integration • Joint extraction and integration • Little to be gained • when the database is not guaranteed to be complete • When we are extracting single entities at a time • Boost accuracy • When extracting records or multi-way relationships consisting of multiple entity subtypes
Integration of Extracted Information • Coupled Extraction and Integration E.g. “In his foreword to Transaction Processing Concepts and Tech- niques, Bruce Lindsay” Existing Books database • Book names where one of the entries is “Transaction Pro- cessing: Concepts and Techniques. “ • People names consisting of entries like “A. Reuters”, “J. Gray”, “B. Lindsay”, “D Knuth”, and so on. • Authors table linking the book titles with the people who wrote them.
Imprecision of Extraction • Confidence Values for Single Extractions • Two ways of representing the imprecision of extraction • To associate each extracted information with probability value • To output multiple possible extractions
Imprecision of Extraction • Confidence Values for Single Extractions • Reliability Plot • A useful visual tool to measure the soundness of the probabilities • X-axis: binned probabilities output by classifier • Y-axis: fraction of test instances in that probability bin whose predictions are correct
Imprecision of Extraction • Multi-attribute Extractions • Extracting multiple attributes of an entity from a single source string E.g. “52-A Goregaon West Mumbai 400 076” • Representing uncertainty through a probability distribution attached to each cloumn.
Imprecision of Extraction • Multi-attribute Extractions • Extracting multiple attributes of an entity from a single source string E.g. “52-A Goregaon West Mumbai 400 076” • Hybrid method: row and column level distributions
Imprecision of Extraction • Multiple Redundant Extractions
Imprecision of Extraction • Multiple Redundant Extractions • Assume only extraction uncertainty and ignore co-reference uncertainty by assuming that an exact method exists for resolving if two strings are the same. • Assume there is only co-reference uncertainty and each string has no uncertainty attached to it referring to an entity.
Imprecision of Extraction • Multiple Redundant Extractions • The Noisy-OR Model: to convert this p1,…,pn into a single probability value p of x • It assumes that different extractions are independent: not practical • 0.1 confidence * 100 => close to 1 • The soft-OR function
Imprecision of Extraction • Multiple Redundant Extractions • Conditional Probability Models from Labeled Data • Based on labeled data • No assumption about independency • Generative Models for Unlabeled Data • a single pattern • Multiple patterns: Pr(y|n1j…nkj) =