Summarizing Relational Databases

Summarizing Relational Databases Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research Labs-Research

Motivation • Complex databases are challenging to explore and query • Consisting of hundreds of inter-linked tables • Users unfamiliar with the schema • Insufficient or unavailable schema information • Propose a principled approach to summarize the contents of a relational database • Cluster similar tables • Label each cluster by its most important table Summarizing Relational Databases

TPCE Summarizing Relational Databases

Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases

Table Importance • Depends on • Internal information content • External connectivity • Join behavior • Taxrate: 1 join • Customer: 5 joins 3 columns 2 columns 24 columns Summarizing Relational Databases

Table Importance (cont’d) • Entropy of Attribute A in table R is defined as • R.A = {a1, …, ak} • pi is the fraction of tuples in R that have value ai on attribute A • The Information Content of a table R is defined as • Create a primary key R.Key to table R • Add a self-loop R.Key – R.Key Summarizing Relational Databases

Table Importance (cont’d) • Entropy transfer matrix associated with schema graph G is defined as: • For a join edge e = R.A – S.B • For a pair of tables R and S, define • qA': number of join edges involving R.A' (including self-join) • VE – Variable entropy transfer Model , Summarizing Relational Databases

Table Importance (cont’d) • [S,T] = Pr(S.S_SymbT.T_S_Symb) • = α/ (IC(S)+2α) • The importance of table R is defined as the stable-state value of a random walk on G, using probability matrix • Vector , s.t. • Importance , R G • Example Security (S) Trade (T) • H(S.S_Symb)=α; qS_Symb=3 G(TPCE) Trade_Request (TR) = • Pr(S.S_Symb T.T_S_Symb) • H(S.S_Symb) α • IC(S)+(qS_Symb-1) ·H(S.S_Symb) IC(S)+2α = Summarizing Relational Databases

Table Similarity • Distance = 1 - similarity • Goal: define metric distance • Enables meaningful clustering over relational databases • Table similarity depends on how join edgesand join paths are instantiated A A B B 1 1 1 1 2 2 2 2 3 3 4 3 R R S S A B 1 1 2 1 3 2 R 2 2 3 3 • R.A = S.B S Summarizing Relational Databases

Table Similarity (cont’d) • Consider a join edge e = R.A – S.B • Tuplest1, t2 instantiate e • fanoute(ti) is the fanout of ti along e • fanoute(t1) = 3 • Let q be the number of tuples in Rs.t. fanoute(ti) > 0, define the matching fraction of Rw.r.t. e as fe(R) = q/n, |R|= n • fe(R) = 2/3 ≤ 1; fe(S) = 5/5 = 1 ≤ 1 • Define the matched average fanoutof Rw.r.t. e as • mafe(R) = (3+2)/2 = 2.5 ≥ 1; mafe(S) = 5/5 = 1 ≥ 1 A B t1 t2 1 1 2 1 3 1 R 3 3 • R.A = S.B S Summarizing Relational Databases

Table Similarity (cont’d) • The similarity of tables R and S (w.r.t. e(R,S)) must satisfy: • Property 1: Proportional to the matching fractions fe(R) and fe(S) • Property 2: Inverse proportional to the matched average fanoutsmafe(R) and mafe(S) • Define the strength of tables R and S (w.r.t. e(R,S)) as • e: R.A = S.B A A B B 1 1 1 1 2 2 2 2 3 3 4 3 R R S S • fe(R) = fe(S) =1 • fe(R) = fe(S) =2/3 • e: R.A = S.B A A B B 1 1 1 1 2 2 2 1 3 3 3 2 R R S 2 • Property 2: Inverse proportional to the matched average • fanoutsmafe(R) and mafe(S) 2 • mafe(R) = mafe(S) =1 3 • mafe(R) = 7/3 • mafe(S) =1 3 S Summarizing Relational Databases

Table Similarity (cont’d) • Let : R = R0 - R1 - … - Rα= S be a path in G, define • Table similarity (R, S): • Distance (R, S) • dists(R,S) = 1 – strength(R,S) • (R, dists) is a metric space R R1 R0 Rα • ei Ri-1 S Ri • : R = R0 - R1 - … - Rα= S Summarizing Relational Databases

Clustering: Weighted k-Center • Clustering Criteria: • Minimize the maximum distance between a cluster center and a table in that cluster • Take table importance into consideration, avoid grouping top important tables into one cluster • Weighted k-Center clustering • Weights: table importance • Given k clusters C = {C1, C2, …, Ck}, minimize • NP-Hard Summarizing Relational Databases

Weighted k-Center: Greedy Approximation Algorithm • Start with one cluster, whose center is the top-1 important table. • Iteratively chooses the table Ri whose weighted distance • from its cluster center is largest, and creates a new cluster with Ri as its center. • All tables that are closer to Ri than to their current • cluster center are reassigned to cluster Ci. Summarizing Relational Databases

Experimental Results • Validate the proposed three components in our approach • Model for table importance IE • Distance function dists • Clustering: Weighted k-Center • Other methods Entropy-based Strength-based Ic: Cardinality-initialized distc = 1 – coverage [1] [1] [1] distp = 1 – proximity [2] [1] C.Yu and H.V.Jagadish. Schema summarization. VLDB 2006. [2] H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD 2007. Summarizing Relational Databases

Experimental Results (cont’d) • Data Sets: TPCE schema • Benchmark database simulating OLTP workload • 33 tables pre-classified into 4 categories • Two database instances: TPCE-1 / TPCE-2 • Affect the size of the majority of tables • Affect Pr(R.AS.B), strength(R,S) for most pairs and mafe for 1/3 of edges Summarizing Relational Databases

TPCE Summarizing Relational Databases

Table Importance • Comparison of IE and IC models • Top-5 Important Tables in IE and their ranks in IC IE more accurate than IC • Top-5 Important Tables in IC and their ranks in IE Summarizing Relational Databases

Table Importance (cont’d) • Consistency of IE and IC models • Top-7 Important Tables in IE and IC for TPCE-1 and TPCE-2 IE more consistent than IC Summarizing Relational Databases

Distance Between Tables • Accuracy of distance functions • Observation: for each table R, its distances to tables within the same category (pre-defined) should be smaller than its distances to tables in different categories • n(R): # top-qnbrs (NNR) of R under dist. d (distS, distC, distP) • m(R): # tables (NNR) in the same category as R under dist. d • Calculate: q=5 distS most accurate Summarizing Relational Databases

Clustering Algorithms Based on IC and distC(coverage) • Accuracy of a summary • TPCE is pre-classified into 4 categories: Broker, Customer,Market and Dimension • m(Ci): # tables in Ci with the same category as center(Ci) • Given a summary C = {C1, C2, …, Ck}, calculate • Balanced-Summary (BS) [1] • Weighted k-Center (WKC) WKC is more accurate Summarizing Relational Databases

Summarization Algorithms • Weighted k-Center over three distance functions distS: most balanced and accurate • Summary Accuracy Summarizing Relational Databases

Related Work • C. Yu and H. V. Jagadish. Schema summarization. VLDB’06 • H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD’07 • W. Wu, B. Reinwald, Y. Sismanis and R. Manjrekar. Discovering topical structures of databases. SIGMOD’08 Summarizing Relational Databases

Conclusions • We proposed a novel approach for summarizing relational schemas • A new model for table importance • A metric distance over schema tables • A summarization algorithm Q&A Summarizing Relational Databases

Summarizing Relational Databases

Summarizing Relational Databases

Presentation Transcript

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

RELATIONAL DATABASES

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases:

Relational Databases

Relational Databases

Relational Databases