370 likes | 392 Views
Schema Summarization. Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 13 th , 2006. Many Databases Are Complex. *Number of elements = #tables + #columns (relational) = #elements + #attributes (XML). Reactome Schema.
E N D
Schema Summarization Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 13th, 2006
Many Databases Are Complex *Number of elements = #tables + #columns (relational) = #elements + #attributes (XML)
What’s the Problem ? • Why are complex schemas difficult to deal with ? • For data integration administrators (DIAs): Difficult to grasp the major topics of a complex schema • For ordinary users: Difficult to identify the small subset of relevant schema elements • Can we avoid them ? • Probably not: scientific databases are in fact getting more and more complex – MiMI is an example
Existing Approaches • Ignorethe schema • Keyword-based search over relational and XML databases • Guess the schema • Schema-Free XQuery, FleXPath, etc. • Limitations: • Provide imprecise (and sometimes incorrect) answers • No help in understanding the schema (and the database) itself
Our Approach • Summarize the schema • Represent the original complex schema with a simpler schema, i.e., a summary of the original schema • Help users explore the schema via the summary • Illustrates the main topics of the database • Filters away irrelevant parts of the schema Challenge: how to create a good summary ?
Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work
A labeled, directed graph Nodes: Relational: table and column Hierarchical: element and attribute Links: Structural links: parent/child constraints Value links: inclusion constraints (key / foreign key) Schema warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*
A schema itself, but: Fewer number of elements Simpler Contains abstract elements and links Abstract element: Represents a group of original elements Abstract link: Connects at least one abstract element state* authors store* @name author* contact book* @id @name @name isbn price title @address author* Schema Summary warehouse author* book*
Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work
What Makes a Good Schema Summary ? • Which one should be the summary ? warehouse warehouse warehouse state* authors store* @name store* author* book* author* contact book* @id @name book* @name isbn price title @address author*
warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author* What Information Do We Need ? • Schema summary is not only a summary of the “schema,” but also in fact a summary of the “database” ! schema structure and data distribution
Desired Properties of Schema Summary • Small enough (in terms of number of elements) to comprehend – Summary Complexity • Show elements in which users are more likely to be interested – Summary Importance • Show elements that represent the entire database well – Summary Coverage • Importance and Coverage calculation will need to consider both schema structure and data distribution
Not all schema elements are created equal ! First Observation: more links, more important - schema Second Observation: more popular, more important - data Intuition Behind Importance warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*
Compute Summary Importance • Schema Element Importance • W: Neighbor Weight – the percentage of ej’s information flows into e, estimated using relative cardinalities • Summary Importance
Intuition Behind Coverage • Important ≠ Inclusion in the summary • Elements can be too “close” to each other • Two basic notions • Element Affinity • Element Coverage warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*
Intuition Behind Coverage, cont’d • Element Affinity: • less hops, higher affinity • higher relative cardinality, lower affinity • Element Coverage: • Element Affinity • Neighbor Weight warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*
Compute Summary Coverage • Schema element affinity from ea to eb • Schema element coverage of eb by ea • Summary Coverage
What makes a good schema summary ? data distribution schema structure summary importance summary coverage
Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work
Overview K Database Schema (1) Annotating Schema Graph (Computing statistics) (Algorithms MaxImportance and MaxCoverage) (2.1) Calculating Importance (2.2) Calculating Coverage Set of K elements with high coverage; Set S of Coverage Domination Pairs List L of elements sorted by Importance (3) Determine K summary elements (Algorithm BalanceSummary) (4) Cluster Original Schema Elements Balanced Summary of Size K
Algorithm MaxImportance • MaxImportance generates a summary of a given size k, maximizing summary importance Compute steady-state element importance values Sort and pick top-k important elements Compute assignments of remaining elements • Complexity: O(N2 + NlogN) * Convergence is proved in [MGR02].
Algorithm MaxCoverage • MaxCoverage generates a summary of a given size k, maximizing summary coverage in a heuristic way Eliminate elements being dominated; Compute summary coverage for all element set of size-k Compute coverage dominance (bottom up with A/D pairs) Pick the set with highest coverage • Complexity: O(kN2nk) * See paper for details on coverage dominance
Generate Balanced Summary • No single optimal criteria to balance the two desired properties • A heuristic approach: • Pick elements in the order of their importance • Ignore elements that are dominated by elements already in the summary • Works well in practice
Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work
Evaluation Strategies • Observation • Comparing automatic summaries with summaries generated by human experts • In general, automatic summaries agree well with human (~ 80%) • An objective evaluation framework • Models schema exploration based query behavior • Query discovery cost: the number of extra elements visited in order to construct a correct query from a query intention
Query Discovery Cost Example • Query Intention: Retrieve ISBN of all books • Query: for $b in doc()/state/store/bookreturn $b/isbn warehouse warehouse Cost = 3 Cost = 5 state* state* authors store* @name store* @name author* author* book* contact book* contact book* @id @name @name isbn @name isbn price price title @address title @address author* author*
Impact of Balancing Importance and Coverage * Percentage in parenthesis shows the reduction in savings
Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work
Related Work • First study on summarizing schemas • Related to ER model abstraction • Limitations of ER model abstraction • Does not reflect the data distribution • ER models may not be available and may be out-of-date • For most database schemas, structure or value links are semantics-free, ER model abstraction methods are ineffective in this case (tagging those links involve significant amount of manual effort)
Related Work, cont’d • Summary element importance calculation is partially inspired by PageRank • Summary element affinity calculation (used in summary coverage) is partially inspired by similar measurements in social network analysis
Conclusions and Contributions • Introduced concept of schema summary • Defined summary importance and summary coverage as desiderata of schema summary • Emphasized both schema structure and data distribution as essential features for importance and coverage calculation • Designed and implemented efficient schema summarization algorithms • An objective evaluation framework