200 likes | 353 Views
A Grammar-based Entity Representation Framework for Data Cleaning. Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar. Problems. Poor data quality is due to lack to unique representations for real world entities
E N D
A Grammar-based Entity Representation Framework forData Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar
Problems • Poor data quality is due to lack to unique representations for real world entities • Eg: California can be represented as California, Calif, CA, etc • Although textually different, these 5 records correspond to just 2 authors
Problem Definition • Main problem in data cleaning is to determine whether or not two representations are duplicate i.e. correspond to same real world entity. • Cosine similarity and Edit distance use textual similarity. But it can be misleading. • Two representations of same entity can be highly dissimilar • Conversely, two representations that are textually very similar can correspond to different entities
Basic Definitions • The Program is a collection of triples of the form <R,P,A> where R is the grammar rule, P is predicate and A is action • The grammar rule has a head and body. Head is single non terminal and body is sequence of non terminals, terminals and variables • Terminals are words and punctuation • Non terminals are represented by angular brackets terminals using single quoted strings (eg:’Jeff’) and variables using uppercase letters
Expanded program G’ for program G • Expanded program G’, like G is a collection of augmented rules • To construct G’, we consider each augmented rule R=<R,P,A> and enumerate all possible assignments of constant values to variables in R so that predicate P evaluates to true i.e. <R’, true, A’>
Parse Tree: • Handles variations in the order in which the first name and last name appear • Program handles variations resulting from the use of nick name
Weights: • Non negative real numbers are assigned to each augmented rule in G’ • The weight of an output record is the sum of weights of augmented rules involved in the parsing of output record • Lower weights indicate high confidence • Programmer can use “loose” rules, rules that the programmer is not very confident about. • Higher weights assigned to “loose” rules • If R’ is augmented rule in expanded program G’, the weight of R’ is the log of number of rules in G’
Implementation • Given a program G, we can construct expanded program G’. Given an input record r, we can use traditional parsing technique to parse r • But the main problem with this approach is that the scale of the expanded program G’ can be very large • Instead, construct Gr’, a partially expanded program at query time. • To construct Gr’, consider R=<R,P,A> and enumerates all possible assignment of constants to variables in R such that P evaluates to true • Enforce an additional constraint, if variable X occurs in R, then the constant c assigned to variable X should be a substring of the record r. Dictionary (X): P(X,.…) • Eg: Smith Andy, J: Dictionary (N): Nicknames (I,N,F,G)
Discussion Record matching: • Previous works on record matching focused on similarity design function • This framework indicates that, with right pre processing the need for approximate equality when performing record matching is minimized and often eliminated • How ever string similarity joins are needed to capture variations such as typos and misspellings • This framework does not intend to replace this body of work
Pay as you go: • The goal of this framework is not to clean the entire dataset, because doing so is difficult • This framework rather approaches “pay as we go” where they use example reference tables that cover only part of data to clean a subset of data Lineage: • Parse trees constitute a natural notion of lineage that can be used to program on top of the module • For eg. Data cleaning developer using this framework can choose not to use rule weighting options and use if- then- else logic to capture parse tree preferences
Uncertainty: • Framework provides a tool to manage uncertainty in the data • Framework incorporates “possible worlds”. Thus it allows multiple possible variations of same entity. • Framework also returns multiple parse trees for same input record with accompanying score.