300 likes | 411 Views
Querying Probabilistic Information Extraction. Daisy Zhe Wang*, Michael Franklin*, Joseph Hellerstein *, Minos Garofalakis (Technical University of Crete, Greece) (*University of California Berkeley, United States of America) VLDB ’10 , Sept. 13-17, 2010, Singapore. Motivation.
E N D
Querying Probabilistic Information Extraction Daisy Zhe Wang*, Michael Franklin*, Joseph Hellerstein*, MinosGarofalakis (Technical University of Crete, Greece) (*University of California Berkeley, United States of America) VLDB ’10, Sept. 13-17, 2010, Singapore
Motivation • To improve the quality of answers produced by Information Extraction over unstructured data • To improve the efficiency of query processing on IE by building on an in-database implementation of the Conditional Random Field model using the Viterbi inference algorithm
Problem Definition • Extend relational query processing to include data obtained from unstructured sources: Common approach has two parts • Identify and label entities within blocks of text using a stand-alone IE technique • Extract data and import into a database, which can then be processed using relational queries
Problem Definition • Traditional query processing does not properly handle probabilistic data, leading to reduced answer quality • Performance efficiency suffers due to the separation of IE methods and query processing
Problem Definition Example: Michelle L. Simpkins WinsteadSechrest & Minick P.C. 100 Congress Avenue, Suite 800 • CRF-based IE used to extract contact information from signature blocks in Enron email corpus • ML extraction from extraction assigns NULL to companyname
Problem Definition Query 1: SELECT * FROM Contacts WHERE companyname LIKE ‘%Winstead%’ Query 2: SELECT * FROM Contacts C1, Contacts C2 WHERE C1.companyname = C2.companyname • Query 1 produces empty result due to erroneous NULL assignment • Query 2 fails to include Michelle Simpkins in the result, again because of the NULL companyname assignment
Problem Definition • Previous work • Conditional Random Fields – which is a probabilistic IE model • Viterbi algorithm – produces the maximum likelihood of a sequence of events over CRF • Limitations • Restricting queries to ML extractions can result in incorrect answers • Optimization opportunities are lost because IE inference is computed separately from the queries over the extracted data
Solution • Two approaches • Deterministic Select-Project-Join queries over the ML results of Viterbi-based IE • Probabilistic SPJ queries are computed over the set of possible ‘worlds’ in the CRF
Probabilistic Database • Two key components • A collection of incomplete relations with missing or uncertain data • A probability distribution on all possible database instances • Each possible database instance is a possible completion of the missing and uncertain data
Conditional Random Fields • Z(x) = normalization function • fk = set of real valued functions • . • A CRF model represents the probability distribution over all possible segmentations of a text string • A segmentation y = {y1, …, yT} is one possible way to tag each token in x with one of the labels in Y
Inference Queries • Top-k Inference • Uses the Viterbi dynamic programming algorithm to compute a two dimensional V matrix. Each cell V(i, y) stores a ranked list of entries e = {score, prev(label, idx)} ordered by score. • Each entry contains: • The score of a top-k (partial) segmentation ending at position i with label y • A pointer to the previous entry prevpn the path that led to top-kscore’s in V(i, y)
Inference Queries • Constrained Top-k Inference • Special case of top-k inference used when a subset of the token labels has been provided (e.g. through a user interface). • Let s be the evidence vector {s1, …, sT} where si is either NULL or the evidence label for yi • Computed by a variant of Viterbi algorithm which restricts the chosen labels to conform with given evidence
Inference Queries Inference Queries • (a) ML segmentation y*, backtracked from the maximum entry in V(T,yT), where T is the length of the token sequence x, is shown in bold arrows • (b) Using Constrainted Top-k Inference
Inference Queries • Marginal Inference Computes a marginal probability p(yt, yt+1, …, yt+k|x) over a single label or sub-sequence of labels. Uses a variation of the Viterbi algorithm called forward-backward
Viterbi Algorithm • Given a Hidden Markov Model (HMM) with states Y, initial probabilities πi of being in state i and transition probabilities ai,j of transitioning from state i to state j, say we observe outputs . The state sequence most likely to have produced the observations is given by the recurrence relations: • Here Vt,k is the probability of the most probable state sequence responsible for the first t + 1 observations (we add one because indexing started at 0) that has k as its final state. The Viterbi path can be retrieved by saving back pointers which remember which state y was used in the second equation.
Setup • Token Table – an incomplete relation which stores text-strings as relations in a database (similar to inverted index). Contains one probabilistic attribute that needs to be inferred – labelp TokenTbl (strID, pos, token, labelp) • Factor Table – a materialization of the factor tables in the CRF model for all the tokens in the corpus. Can be used to compute the probability distribution over all possible “worlds” of TokenTbl. FactorTbl uses an array to store scores ordered by {prevLabel, label} FactorTbl (token, score ARRAY[ ])
Setup • Entity Table – contains a set of probabilistic attributes, one for each label in Y. Entitytbl is defined and generated over the possible labelings in the TokenTbl
Querying ML • Two families of SPJ Queries: • Deterministic SPJ over ML views of entity tables • Probabilistic SPJ over entity tables ML view of entity table: CREATE VIEW entityTb11-ML as SELECT *, rank() OVER (ORDER BY prob(*) DESC) r WHERE r = 1;
Deterministic SPJ Optimized Selection over ML SELECT * FROM Address-ML WHERE streetname like ‘%Sacramento%’ • Rewritten into 2 selection conditions: 1) Test if the text string d contains the token sequence in the selection xcond 2) Test if the position(s) where xcond appears in d are assigned the label in the selection ycond. This may span several tokens
Deterministic SPJ Optimized Join over ML • Compute Viterbi inference over the smaller of two sets • Build a hash table of join-attribute values • Perform Viterbi inference only on the documents in the outer set that contain at least one of the hashed values • Perform the pivot operation to compute the ML views of the entity tables
Experiment • Dataset for accuracy: Contact [19] – 182 signature blocks from Enron email dataset annotated with the contact record tags (city, firstname, etc) • CRF model developed at University of Massachusetts • Use both false positives and false negatives as measures of accuracy • Ground Truth: manual tagging performed in [12]
Dataset for efficiency and scalability: DBLP [20] – more than 700k papers with attributes (conference, year, etc.) • CRF model similar to [12] (Univ of Ma.)
Probabilistic Declarative Information Extraction Daisy Zhe Wang*, EirinaiosMichelakis*, Michael J. Franklin*, Joseph Hellerstein*, MinosGarofalakis (Technical University of Crete, Greece) (*University of California Berkeley, United States of America) ICDE ’10
Motivation • Information Extraction – Parsing text to extract structured objects which can be integrated into a traditional database
Motivation • Two major themes of IE: • Design of declarative languages and systems to perform IE • Probabilistic Database Systems (PDBS) can model the uncertainty inherent in IE outputs, enables users to write declarative queries that reason about that uncertainty
Motivation • Produce a unified database system that enables declarative Information Extraction tasks, and provides a probabilistic framework for querying the Information Extracted
Problem Definition • Initially stored in a PDBS that supported tuple-level uncertainty as well as attribute-level uncertainty with restrictive dependency structures • Loss of complex dependency structures, allowing for only a rough approximation of the CRF distribution model
Solution • Relational Representation of Inputs • Conditional Random Fields can be naturally modeled in a relational database • Text data captured relationally(?) via inverted file representation (table which stores a mapping of a word to its location) • Declarative Viterbi Inference • Given tabular CRF model parameters and input text, Viterbi inference can be expressed as a standard recursive SQL query for dynamic programming