460 likes | 565 Views
Integration and representation of unstructured text in relational databases. Sunita Sarawagi IIT Bombay. Unstructured data. Database. Citeseer/Google scholar Structured records from publishers. Publications from homepages. Company database: products with features.
E N D
Integration and representation of unstructured text in relational databases Sunita Sarawagi IIT Bombay
Unstructured data Database Citeseer/Google scholar Structured records from publishers Publications from homepages Company database: products with features Product reviews on the web Customer emails HR database Resumes: skills, experience, references (emails) Text resume in an email Extract bibtex entries when I download a paper Enter missing contacts via web search Personal Databases: bibtex, address book
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI1988 [10] also see Articles Journals 3 Top-level entities Writes Authors Probabilistic variant links to canonical entries Database: imprecise
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI1988 [10] also see Articles Extraction Journals Author: R. Fagin A Author: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998 Writes Authors Integration Match with existing linked entities while respecting all constraints
Outline • Statistical models for integration • Extraction while fully exploiting existing database • Entity match, Entity pattern, link/relationship constraints, • Integrate extracted entities, resolve if entity already in database • Performance challenges • Efficient graphical model inference algorithms • Indexing support • Representing uncertainty of integration in DB • Imprecise databases and queries
Flexible overlapping features • identity of word • ends in “-ski” • is capitalized • is part of a noun phrase? • is under node X in WordNet • is in bold font • is indented • next two words are “and Associates” • previous label is “Other” Extraction using chain CRFs R. Fagin and J. Helpern, Belief, awareness, reasoning t x y y1 y2 y3 y4 y5 y6 y7 y8 Difficult to effectively combine features from labeled unstructured data and structured DB
Similarity to author’s column in database CRFs for Segmentation t x y Features describe the single word “Fagin” l,u x y Features describe the segment from l to u
Features from database • Similarity to a dictionary entry • JaroWinkler, TF-IDF • Similarity to a pattern level dictionary • Regex based pattern index for database entities • Entity classifier • A multi-class regression model which gives likelihood of a segment being a particular entity type • Features for the classifier: all standard entity-level extraction features
Segmentation models • Segmentation • Input: sequence x=x1,x2..xn, label set Y • Output: segmentation S=s1,s2…sp • sj = (start position, end position, label) = (tj,uj,yj) • Score: F(x,s) = • Transition potentials • Segment starting at i has label y and previous label is y’ • Segment potentials • Segment starting at i’, ending at i, and with label y. • All positions from i’ to i get same label. • Probability of a segmentation: • Inference O(nL2) • Most likely segmentation, Marginal around segments
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI1988 [10] also see Articles Extraction Journals Author: R. Fagin A Author: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998 Writes Authors Integration Match with existing linked entities while respecting all constraints
CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoningin AI Combined Extraction+integration Only extraction Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000 Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000 Year mismatch!
Combined extraction + matching • Convert predicted label to be a pair y = (a,r) • (r=0) means none-of-the-above or a new entry l,u x y r Id of matching entity Constraints exist on ids that can be assigned to two segments
Constrained models • Two kinds of constraints between arbitrary segments • Foreign key constraint across their canonical-ids • Cardinality constraint • Training • Ignore constraints or use max-margin methods that require only MAP estimates • Application: • Formulate as a constrained integer programming problem (expensive) • Use general A-star search to find most likely constrained assignment
Effect of database on extraction performance L = Only labeled structured data L + DB: similarity to database entities and other DB features (Mansuri and Sarawagi ICDE 2006)
Full integration performance • L = conventional extraction + matching • L + DB = technology presented here • Much higher accuracies possible with more training data (Mansuri and Sarawagi ICDE 2006)
Outline • Statistical models for integration • Extraction while fully exploiting existing database • Entity match, Entity pattern, link/relationship constraints, • Integrate extracted entities, resolve if entity already in database • Performance challenges • Efficient graphical model inference algorithms • Indexing support • Representing uncertainty of integration in DB • Imprecise databases and queries
Surface features (cheap) Database lookup features (expensive!) Authors Name M Y Vardi J. Ullman Efficient search for top-k most similar entities Ron Fagin Claire Cardie J. Gherke Thorsten J Kleinberg Inverted index S Chakrabarti Jay Shan Jackie Chan Bill Gates Jeffrey Ullman Inference in segmentation models R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998 Many large tables • Batch up to do better than individual top-k? • Find top segmentation without top-k matches for all segments?
Q: query segment E: an entry in the database D Similarity score: Goal: get k highest scoring Es in D Score bounds Tuple id upper lower Bounds on normalized idf values (cached) t1 t2 t3 tU - - - Candidate matches Upper and lower bounds on dictionary match scores Tidlists: pointers to DB tuples (on disk) Top-k similarity search 1. Fetch/merge tidlist subsets 2. Point queries
Best segmentation with inexact, bounded features • Normal Viterbi: • Forward pass over data positions, at each position maintain • Best segmentation ending at that position • Modify to: best-first search with selective feature refinement s(3,3) s(1,1) s(3,4) s(5,5) End state s(1,2) s(0,0) s(3,5) Suffix upper/lower bound: from a backward Viterbi with bounded features s(1,3) s(4,4) (Chandel, Nagesh and Sarawagi, ICDE 2006)
Performance results DBLP authors and titles 100 citations (Chandel, Nagesh and Sarawagi, ICDE 2006)
Surface features (cheap) Inference in segmentation models R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998 Not quite! Semi-CRFs 3—8 times slower than chain CRFs
Key insight • Applications have a mix of token-level and segment-level features • Many features applicable to several overlapping segments • Compactly represent the overlap through new forms of potentials • Redesign inference algorithms to work on compact features • Cost is independent of number of segments a feature applies to (Sarawagi, ICML 2006)
Compact potentials • Four kinds of potentials
Outline • Statistical models for integration • Extraction while fully exploiting existing database • Entity match, Entity pattern, link/relationship constraints, • Integrate extracted entities, resolve if entity already in database • Performance challenges • Efficient graphical model inference algorithms • Indexing support • Representing uncertainty of integration in DB • Imprecise databases and queries
Probabilistic Querying Systems • Integration systems while improving, cannot be perfect particularly for domains like the web • Users supervision of each integration result impossible Create uncertainty-aware storage and querying engines • Two enablers: • Probabilistic database querying engines over generic uncertainty models • Conditional graphical models produce well-calibrated probabilities
Probabilities in CRFs are well-calibrated Cora citations Cora headers Ideal Ideal Probability of segmentation Probability correct E.g: 0.5 probability Correct 50% of the times
IEEE Intl. Conf. On data mining 0.8 Conf. On data mining 0.2 D Johnson 16000 0.6 J Ullman 13000 0.4 Uncertainty in integration systems Unstructured text Model Additional training data p1 Entities Very uncertain? Other more compact models? Entities p2 Entities pk Probabilistic database system Select conference name of article RJ03? Find most cited author?
Segmentation-per-row model (Rows: Uncertain; Cols: Exact) Exact but impractical. We can have too many segmentations!
One-row Model Each column is a multinomial distribution (Row: Exact; Columns: Independent, Uncertain) e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252 Simple model, closed form solution, poor approximation.
Multi-row Model Segmentation generated by a ‘mixture’ of rows (Rows: Uncertain; Columns: Independent, Uncertain) Excellent storage/accuracy tradeoff Populating probabilities challenging (Gupta and Sarawagi, VLDB 2006)
Populating a multi-row model • Challenge • Learning parameters of a mixture model to approximate the SemiCRF but without enumerating the instances from the model • Solution • Find disjoint partitions of string • Direct operation on marginal probability vectors (efficiently computable for SemiCRFs) • Each partition a row
Experiments: Need for multi-row • KL very high at m=1. One-row model clearly inadequate. • Even a two-row model is sufficient in many cases.
What next in data integration? • Lots to be done in building large-scale, viable data integration systems • Online collective inference • Cannot freeze database • Cannot batch too many inferences • Need theoretically sound, practical alternatives to exact, batch inference • Queries and Mining over imprecise databases • Models of imprecision for results of deduplication
Summary • Data integration with statistical models an exciting research direction + a useful problem • Four take-home messages • Segmentation models (semi-CRFs) provide a more elegant way to exploit entity features and build integrated models (NIPS 2004, ICDE 2006a) • A-star search adequate for link and cardinality constraints (ICDE 2006a) • Recipe for combing two top-k searches so that expensive DB lookup features are refined gradually (ICDE 2006b) • An efficient segmentation model with succinct representation of overlapping features + message passing over partial potentials (NIPS 2005 workshop) Software: http://crf.sourceforge.net
Outline • Problem statement and goals • Models for data integration • Information Extraction • State-of-the-art • Overview: Conditional Random Fields • Our extensions to incorporate database of entity names • Entity matching • Combined model for extraction and matching • Extending to multi-relational data
Outline • Problem statement and goals • Models for data integration • Information Extraction • State-of-the-art • Overview: Conditional Random Fields • Our extensions to incorporate database of entity names • Entity matching • Combined model for extraction and matching • Extending to multi-relational data
Entity resolution Authors Variants Jeffrey Ullman J. Ullmann Jefry Ulman Jeffrey Smith Prof. J. Ullman J Smith Michael Stonebraker Mike Stonebraker Pedro Domingos M, Stonebraker ? Domingos, P. Labeled data: record pairs with labels 0 (red-edges) 1 (black-edges) Input features: • Various kinds of similarity functions between attributes • Edit distance, Soundex, N-grams on text attributes • Jaccard, Jaro-Winkler, • Subset match • Classifier: any binary classifier • CRF for extensibility
CRFs for predicting matches • Given record pair (x1 x2), predict y=1 or 0 as • Efficiency: • Training: filter and only include pairs which satisfy conditions like at least one common n-gram
Link constraints in multi-relational data • Any pair of segments in previous output needs to satisfy two conditions • Foreign key constraint across their canonical-ids • Cardinality constraint • Our solution: Constrained Viterbi (branch and bound search) • Modified search that retains with best path labels along the path • Backtracks when constraints are violated
Normal CRF Normal CRF The final picture. • Entity column names in the database: • Surface patterns, regular expression: • Example: pattern: X. [X.] Xx* author name • Commonly occurring words: • Journal, IEEE journal name • Ordering of words: • Part after “In” is journal name • Similarity-based features: • Labeled data: Order of attributes: • Title before journal name • Canonical links: • Schema-level: cardinality of attributes • Links between entities: • what entity is allowed to go with what. Semi-CRF Compound-label Constrained Viterbi
Summary • Exploiting existing large databases to bridge with unstructured data, an exciting research problem with many applications • Conditional graphical models to combine all possible clues for extraction/matching in a simple framework • Probabilistic: robust to noise, soft predictions • Ongoing work: • Probabilistic output for imprecise query processing
Available clues.. • Entity column names in the database: • Surface patterns, regular expression: • Example: pattern: X. [X.] Xx* author name • Commonly occurring words: • Journal, IEEE journal name • Ordering of words: • Part after “In” is journal name • TF-IDF similarity with stored entities • Labeled data: Order of attributes: • Title before journal name • Schema-level: cardinality of attributes • Links between entities: • what entity is allowed to go with what.
Adding structure to unstructured data • Extensive research in web, NLP, machine learning, data mining and database communities. • Most current research ignores existing structured databases • Database just a store at the last step of data integration. • Our goal • Extend statistical models to exploit database of entities and relationships • Models: persistent, part of database, stored, indexed, evolving and improving along with data.