Schema Summarization

Schema Summarization Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 13th, 2006

Many Databases Are Complex *Number of elements = #tables + #columns (relational) = #elements + #attributes (XML)

Reactome Schema

What’s the Problem ? • Why are complex schemas difficult to deal with ? • For data integration administrators (DIAs): Difficult to grasp the major topics of a complex schema • For ordinary users: Difficult to identify the small subset of relevant schema elements • Can we avoid them ? • Probably not: scientific databases are in fact getting more and more complex – MiMI is an example

Existing Approaches • Ignorethe schema • Keyword-based search over relational and XML databases • Guess the schema • Schema-Free XQuery, FleXPath, etc. • Limitations: • Provide imprecise (and sometimes incorrect) answers • No help in understanding the schema (and the database) itself

Our Approach • Summarize the schema • Represent the original complex schema with a simpler schema, i.e., a summary of the original schema • Help users explore the schema via the summary • Illustrates the main topics of the database • Filters away irrelevant parts of the schema Challenge: how to create a good summary ?

Talk Outline • Motivation • Background Definitions • Desiderata of Schema Summary • Efficient Schema Summarization • Evaluation • Conclusion and Related Work

A labeled, directed graph Nodes: Relational: table and column Hierarchical: element and attribute Links: Structural links: parent/child constraints Value links: inclusion constraints (key / foreign key) Schema warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

A schema itself, but: Fewer number of elements  Simpler Contains abstract elements and links Abstract element: Represents a group of original elements Abstract link: Connects at least one abstract element state* authors store* @name author* contact book* @id @name @name isbn price title @address author* Schema Summary warehouse author* book*

What Makes a Good Schema Summary ? • Which one should be the summary ? warehouse warehouse warehouse state* authors store* @name store* author* book* author* contact book* @id @name book* @name isbn price title @address author*

warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author* What Information Do We Need ? • Schema summary is not only a summary of the “schema,” but also in fact a summary of the “database” ! schema structure and data distribution

Desired Properties of Schema Summary • Small enough (in terms of number of elements) to comprehend – Summary Complexity • Show elements in which users are more likely to be interested – Summary Importance • Show elements that represent the entire database well – Summary Coverage • Importance and Coverage calculation will need to consider both schema structure and data distribution

Not all schema elements are created equal ! First Observation: more links, more important - schema Second Observation: more popular, more important - data Intuition Behind Importance warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

Compute Summary Importance • Schema Element Importance • W: Neighbor Weight – the percentage of ej’s information flows into e, estimated using relative cardinalities • Summary Importance

Intuition Behind Coverage • Important ≠ Inclusion in the summary • Elements can be too “close” to each other • Two basic notions • Element Affinity • Element Coverage warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

Intuition Behind Coverage, cont’d • Element Affinity: • less hops, higher affinity • higher relative cardinality, lower affinity • Element Coverage: • Element Affinity • Neighbor Weight warehouse state* authors store* @name author* contact book* @id @name @name isbn price title @address author*

Compute Summary Coverage • Schema element affinity from ea to eb • Schema element coverage of eb by ea • Summary Coverage

What makes a good schema summary ? data distribution schema structure summary importance summary coverage

Overview K Database Schema (1) Annotating Schema Graph (Computing statistics) (Algorithms MaxImportance and MaxCoverage) (2.1) Calculating Importance (2.2) Calculating Coverage Set of K elements with high coverage; Set S of Coverage Domination Pairs List L of elements sorted by Importance (3) Determine K summary elements (Algorithm BalanceSummary) (4) Cluster Original Schema Elements Balanced Summary of Size K

Algorithm MaxImportance • MaxImportance generates a summary of a given size k, maximizing summary importance Compute steady-state element importance values Sort and pick top-k important elements Compute assignments of remaining elements • Complexity: O(N2 + NlogN) * Convergence is proved in [MGR02].

Algorithm MaxCoverage • MaxCoverage generates a summary of a given size k, maximizing summary coverage in a heuristic way Eliminate elements being dominated; Compute summary coverage for all element set of size-k Compute coverage dominance (bottom up with A/D pairs) Pick the set with highest coverage • Complexity: O(kN2nk) * See paper for details on coverage dominance

Generate Balanced Summary • No single optimal criteria to balance the two desired properties • A heuristic approach: • Pick elements in the order of their importance • Ignore elements that are dominated by elements already in the summary • Works well in practice

Evaluation Strategies • Observation • Comparing automatic summaries with summaries generated by human experts • In general, automatic summaries agree well with human (~ 80%) • An objective evaluation framework • Models schema exploration based query behavior • Query discovery cost: the number of extra elements visited in order to construct a correct query from a query intention

Query Discovery Cost Example • Query Intention: Retrieve ISBN of all books • Query: for $b in doc()/state/store/bookreturn $b/isbn warehouse warehouse Cost = 3 Cost = 5 state* state* authors store* @name store* @name author* author* book* contact book* contact book* @id @name @name isbn @name isbn price price title @address title @address author* author*

Data Sets

Summary Benefits

Contributions of Schema Structure and Data Distribution

Impact of Balancing Importance and Coverage * Percentage in parenthesis shows the reduction in savings

Related Work • First study on summarizing schemas • Related to ER model abstraction • Limitations of ER model abstraction • Does not reflect the data distribution • ER models may not be available and may be out-of-date • For most database schemas, structure or value links are semantics-free, ER model abstraction methods are ineffective in this case (tagging those links involve significant amount of manual effort)

Related Work, cont’d • Summary element importance calculation is partially inspired by PageRank • Summary element affinity calculation (used in summary coverage) is partially inspired by similar measurements in social network analysis

Conclusions and Contributions • Introduced concept of schema summary • Defined summary importance and summary coverage as desiderata of schema summary • Emphasized both schema structure and data distribution as essential features for importance and coverage calculation • Designed and implemented efficient schema summarization algorithms • An objective evaluation framework

Questions ?

Schema Summarization

Schema Summarization

Presentation Transcript

Text summarization

Summarization

Text summarization

Schema Schema Integration

Summarization Techniques

Email Summarization

Document Summarization

Summarization

Document Summarization

Scene Summarization

Summarization

Summarization

Summarization

SUMMARIZATION

Summarization

Text summarization

Text summarization

Text Summarization

HW9 summarization

Speech Summarization

Speech Summarization

Summarization