1 / 20

A Grammar-based Entity Representation Framework for Data Cleaning

A Grammar-based Entity Representation Framework for Data Cleaning. Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar. Problems. Poor data quality is due to lack to unique representations for real world entities

affrica
Download Presentation

A Grammar-based Entity Representation Framework for Data Cleaning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Grammar-based Entity Representation Framework forData Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar

  2. Problems • Poor data quality is due to lack to unique representations for real world entities • Eg: California can be represented as California, Calif, CA, etc • Although textually different, these 5 records correspond to just 2 authors

  3. Problem Definition • Main problem in data cleaning is to determine whether or not two representations are duplicate i.e. correspond to same real world entity. • Cosine similarity and Edit distance use textual similarity. But it can be misleading. • Two representations of same entity can be highly dissimilar • Conversely, two representations that are textually very similar can correspond to different entities

  4. Solution: Programmable Framework

  5. Basic Definitions • The Program is a collection of triples of the form <R,P,A> where R is the grammar rule, P is predicate and A is action • The grammar rule has a head and body. Head is single non terminal and body is sequence of non terminals, terminals and variables • Terminals are words and punctuation • Non terminals are represented by angular brackets terminals using single quoted strings (eg:’Jeff’) and variables using uppercase letters

  6. Example: Framework program

  7. Expanded program G’ for program G • Expanded program G’, like G is a collection of augmented rules • To construct G’, we consider each augmented rule R=<R,P,A> and enumerate all possible assignments of constant values to variables in R so that predicate P evaluates to true i.e. <R’, true, A’>

  8. Parse Tree: • Handles variations in the order in which the first name and last name appear • Program handles variations resulting from the use of nick name

  9. Weights: • Non negative real numbers are assigned to each augmented rule in G’ • The weight of an output record is the sum of weights of augmented rules involved in the parsing of output record • Lower weights indicate high confidence • Programmer can use “loose” rules, rules that the programmer is not very confident about. • Higher weights assigned to “loose” rules • If R’ is augmented rule in expanded program G’, the weight of R’ is the log of number of rules in G’

  10. Implementation • Given a program G, we can construct expanded program G’. Given an input record r, we can use traditional parsing technique to parse r • But the main problem with this approach is that the scale of the expanded program G’ can be very large • Instead, construct Gr’, a partially expanded program at query time. • To construct Gr’, consider R=<R,P,A> and enumerates all possible assignment of constants to variables in R such that P evaluates to true • Enforce an additional constraint, if variable X occurs in R, then the constant c assigned to variable X should be a substring of the record r. Dictionary (X): P(X,.…) • Eg: Smith Andy, J: Dictionary (N): Nicknames (I,N,F,G)

  11. Case studies1. UCD people data

  12. Quality of record matching and Record matching

  13. 2. Author Affiliation Dataset

  14. Program:

  15. Discussion Record matching: • Previous works on record matching focused on similarity design function • This framework indicates that, with right pre processing the need for approximate equality when performing record matching is minimized and often eliminated • How ever string similarity joins are needed to capture variations such as typos and misspellings • This framework does not intend to replace this body of work

  16. Pay as you go: • The goal of this framework is not to clean the entire dataset, because doing so is difficult • This framework rather approaches “pay as we go” where they use example reference tables that cover only part of data to clean a subset of data Lineage: • Parse trees constitute a natural notion of lineage that can be used to program on top of the module • For eg. Data cleaning developer using this framework can choose not to use rule weighting options and use if- then- else logic to capture parse tree preferences

  17. Uncertainty: • Framework provides a tool to manage uncertainty in the data • Framework incorporates “possible worlds”. Thus it allows multiple possible variations of same entity. • Framework also returns multiple parse trees for same input record with accompanying score.

  18. Questions???

  19. Thank you!

More Related