360 likes | 524 Views
Efficient Discovery of XML Data Redundancies. Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th , 2006. Talk Outline. Motivating Example A Comprehensive Notion of XML FD XML Redundancy Discovery Algorithms Experimental Evaluation
E N D
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12th, 2006
Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion
An Example XML Document warehouse state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
Constraints on XML Data • An example constraint: For any two books, if they have the same ISBN, then they have the same title. • Similar to Equality Generating Dependencies (EGDs) [BV84] and Nested EGDs [YP04] Condition Element(s) Implication Element(s) Target
Data Redundancies • E.g., title is redundantly stored • Result of “non-optimal” design of the database schema in the presence of constraints • Lead to: • Update anomalies • Increased cost for data transfer and manipulation • Constraints are the properties of data • May not be known at the design phase
Goal Efficiently Discover Redundancies From the XML Database By Discovering Satisfied Constraints
Main Contributions • A comprehensive notion of XML FD • Capturing a semantically richer set of XML constraints • Definition of XML data redundancy in terms of XML FDs and XML Keys • Efficient algorithms for discovering FDs and data redundancies from an XML database • Experimental Evaluation
Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion
Example XML Constraints • Hierarchical: condition and/or implication elements can come from multiple hierarchies … … state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price ISBN title price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
Example XML Constraints, Cont’d • Set elements: condition and/or implication elements can involve set elements … … state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price ISBN title price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
Functional Dependencies (FDs) • FDs are used to describe constraints in relational databases • A similar notion of FD is needed for XML • Challenges: • Target is difficult to specify due to the hierarchical structure • Set elements introduce new semantics XML FD needs richer semantics !
Previous Notions • Path Based Notion [LLL02,VLL04] • Example: {/warehouse/state/store/book/ISBN} /warehouse/state/store/book/title • Format: LHS RHS • Semantics: for any two RHS nodes, same (associated) LHS indicates same RHS • Tree Tuple Based Notion [AL04] • A tree tuple is a data tree, with exactly one data node for each schema element • Format: LHS RHS • Semantics: for any two tree tuples, same LHS indicates same RHS
Previous Notions, cont’d • Both capture hierarchical constraints • Neither can capture set constraints • {/store/book/ISBN} /store/book/au • Violated in previous • Satisfied if the two au nodes are a single set • {/store/book/title, /store/book/au} /store/book/ISBN • Undefined in previous • Intuitive if au nodes are a single set store name book “Borders” au au price title ISBN “… 269” “R.R.” “DB” “$59.9” “J.G.”
A New Comprehensive Notion • Generalized Tree Tuple • A data tree constructed around a pivot data node (np) • Entire subtree rooted at np is kept • All ancestors of np and their “attributes” are kept • Tuple Class CP • The set of all generalized tree tuples, whose pivot nodes share the same path P (called pivot path)
Example Generalized Tree Tuple warehouse Pivot state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
Example Generalized Tree Tuple Pivot warehouse state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
XML FD • <CP, LHS, RHS>: LHS RHS w.r.t. CP • Semantics: for any two generalized tree tuple t1, t2 in CP, if they share the same LHS, they have the same RHS. • E.g., {./title, ./au} ./ISBN, w.r.t. C/warehouse/state/store/book
Repeatable Elements Are Special warehouse state state state store store store … … name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au title ISBN “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
Essential Tuple Classes • Definition: Tuple classes with pivot paths that correspond to repeatable schema elements • C/warehouse/state/store/book is essential • C/warehouse/state/store/name is not • Express XML FDs that are expressible with non-essential tuple classes • See paper for detailed proof
XML Key and Data Redundancy • Let attribute @key uniquely identify each node in the entire data tree • <CP, LHS> is an XML Key, when the database satisfies XML FD: LHS ./@key w.r.t. CP • Similar to the relative key notion proposed in [BDF+01] • Data redundancy exists if the database: • Satisfies the XML FD <CP, LHS, RHS>, • But <CP, LHS> is not an XML key RHS is redundantly stored.
Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion
Strategy • Discover satisfied XML FDs and Keys • Data redundancies can then be discovered based on the definition • First, we need an efficient representation of the XML data
Hierarchical Representation of XML Data • Each essential tuple class a relation • Similar to nested relations [OY87,MNE96] • All relations together form a hierarchy • Tree tuples can be reconstructed by joining @key with parent R_state @key parent 2 root 3 root 18 root . . . . . R_book @key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9 R_au @key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G. R_store @key parent name 4 3 Borders 12 3 Amazon 19 18 Borders
Intra-Relation FDs • {./ISBN} ./title, w.r.t. C/warehouse/state/store/book … … state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269”
Inter-Relation FDs • {../name, ./ISBN} ./price, w.r.t. C/warehouse/state/store/book … … Present in R_store state state store store store name name name “Borders” “Amazon” book book book “Borders” au au price title ISBN price au au ISBN title “R.R.” “$59.9” “DB” “J.G.” “… 269” price “… 269” “R.R.” “DB” “$59.9” “J.G.” title ISBN “DB” “$51.1” “… 269” Present in R_book
Overview of the Discovery Process • Only interested in minimal FDs • Bottom-Up • At each relation • Discover intra-relation FDs and Keys • Discover inter-relation FDs and Keys involving descendant relations • Generate candidate inter-relation FDs and Keys for examination at the parent level • Attribute Partition as the basic data structure
Attribute Partition • Groups tuples according to the attribute value • ∏{price} for Cbook = { {t6,t20}, {t13} } ∏{@key} for Cbook = { {t6}, {t20}, {t13} } ∏{price, @key} for Cbook = { {t6}, {t20}, {t13} } • FD: LHS RHS w.r.t. CP is satisfied iff: ∏LHS∪RHS = ∏LHS R_book @key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9
Set Attribute Partition • Generated through refinement Initialize ∏{au} for R_book to be { {t6, t13, t20} } ∏{@text} for R_au = { {t10, t24}, {t11, t25} } { {t6, t20}, {t6, t20} } ∏au for R_book = { {t6, t20}, {t13} } • ∏au can then be used as a normal partition R_au @key parent @text 10 6 R.R. 11 6 J.G. 24 20 R.R. 25 20 J.G. R_book @key parent ISBN title price 6 4 …269 DB $59.9 13 12 …269 DB $51.1 20 19 …269 DB $59.9 Convert to parent Refine ∏{au}using partitions in ∏{@text}
Discovery Algorithms • DiscoverFD: • Discover intra-relation FDs and Keys • Similar to existing relational algorithms • DiscoverXFD: • Discover inter-relation FDs and Keys • Key component: • Candidate inter-relation XML FD generation
Generating Candidate Inter-Relation FDs • Let P' be a parent relation of P • Parent satisfaction property • For LHS∪X RHS w.r.t. CP to hold for any attribute set X in relation P', LHS∪{./parent} RHS w.r.t. CP must hold • Child implication property • For LHS∪X RHS w.r.t. CP to be a non-trivial FD for any attribute set X in relation P', LHS RHS w.r.t. CP must not hold • An FD is a candidate inter-relation FD if it satisfies both properties
Talk Outline • Motivating Example • A Comprehensive Notion of XML FD • XML Redundancy Discovery Algorithms • Experimental Evaluation • Conclusion
DBLP contains a fair amount of redundancy, as noted earlier in [AL04] as well ~ 10% redundancies in PIR (measured as # of redundant elements over total # of elements), schema modification reported to PIR Real Datasets
Scalability on XMark • Linear in terms of scale factor (# of elements) – even though exponential in theory • Orders of magnitude faster than direct application of a state-of-the-art relational discovery algorithm • The latter takes over 3 hours to run on XMark scale factor 1
Related Work • XML Integrity Constraints (FDs and Keys) • [BDF+01], [LLL02], [FS03] • XML Normal Form • [AL04], [VLL04] • Nested Relation Normal Form • [OY87], [MNE96] • Relational FD discovery • FUN, Dep-Miner, TANE, fdep, FastFDs
Conclusion • A comprehensive notion of XML FDs and Keys, capturing set semantics • A system for for detecting XML data redundancies through the discovery of FDs and Keys • The system is practical for real datasets and out-performs direct application of the best available relational algorithm by orders of magnitude.