270 likes | 414 Views
Summarizing Relational Databases. Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research. Labs-Research. Motivation. Complex databases are challenging to explore and query Consisting of hundreds of inter-linked tables
E N D
Summarizing Relational Databases Xiaoyan Yang National University of Singapore Cecilia Magda Procopiuc Divesh Srivastava AT&T Labs-Research Labs-Research
Motivation • Complex databases are challenging to explore and query • Consisting of hundreds of inter-linked tables • Users unfamiliar with the schema • Insufficient or unavailable schema information • Propose a principled approach to summarize the contents of a relational database • Cluster similar tables • Label each cluster by its most important table Summarizing Relational Databases
TPCE Summarizing Relational Databases
Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases
Table Importance • Depends on • Internal information content • External connectivity • Join behavior • Taxrate: 1 join • Customer: 5 joins 3 columns 2 columns 24 columns Summarizing Relational Databases
Table Importance (cont’d) • Entropy of Attribute A in table R is defined as • R.A = {a1, …, ak} • pi is the fraction of tuples in R that have value ai on attribute A • The Information Content of a table R is defined as • Create a primary key R.Key to table R • Add a self-loop R.Key – R.Key Summarizing Relational Databases
Table Importance (cont’d) • Entropy transfer matrix associated with schema graph G is defined as: • For a join edge e = R.A – S.B • For a pair of tables R and S, define • qA': number of join edges involving R.A' (including self-join) • VE – Variable entropy transfer Model , Summarizing Relational Databases
Table Importance (cont’d) • [S,T] = Pr(S.S_SymbT.T_S_Symb) • = α/ (IC(S)+2α) • The importance of table R is defined as the stable-state value of a random walk on G, using probability matrix • Vector , s.t. • Importance , R G • Example Security (S) Trade (T) • H(S.S_Symb)=α; qS_Symb=3 G(TPCE) Trade_Request (TR) = • Pr(S.S_Symb T.T_S_Symb) • H(S.S_Symb) α • IC(S)+(qS_Symb-1) ·H(S.S_Symb) IC(S)+2α = Summarizing Relational Databases
Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases
Table Similarity • Distance = 1 - similarity • Goal: define metric distance • Enables meaningful clustering over relational databases • Table similarity depends on how join edgesand join paths are instantiated A A B B 1 1 1 1 2 2 2 2 3 3 4 3 R R S S A B 1 1 2 1 3 2 R 2 2 3 3 • R.A = S.B S Summarizing Relational Databases
Table Similarity (cont’d) • Consider a join edge e = R.A – S.B • Tuplest1, t2 instantiate e • fanoute(ti) is the fanout of ti along e • fanoute(t1) = 3 • Let q be the number of tuples in Rs.t. fanoute(ti) > 0, define the matching fraction of Rw.r.t. e as fe(R) = q/n, |R|= n • fe(R) = 2/3 ≤ 1; fe(S) = 5/5 = 1 ≤ 1 • Define the matched average fanoutof Rw.r.t. e as • mafe(R) = (3+2)/2 = 2.5 ≥ 1; mafe(S) = 5/5 = 1 ≥ 1 A B t1 t2 1 1 2 1 3 1 R 3 3 • R.A = S.B S Summarizing Relational Databases
Table Similarity (cont’d) • The similarity of tables R and S (w.r.t. e(R,S)) must satisfy: • Property 1: Proportional to the matching fractions fe(R) and fe(S) • Property 2: Inverse proportional to the matched average fanoutsmafe(R) and mafe(S) • Define the strength of tables R and S (w.r.t. e(R,S)) as • e: R.A = S.B A A B B 1 1 1 1 2 2 2 2 3 3 4 3 R R S S • fe(R) = fe(S) =1 • fe(R) = fe(S) =2/3 • e: R.A = S.B A A B B 1 1 1 1 2 2 2 1 3 3 3 2 R R S 2 • Property 2: Inverse proportional to the matched average • fanoutsmafe(R) and mafe(S) 2 • mafe(R) = mafe(S) =1 3 • mafe(R) = 7/3 • mafe(S) =1 3 S Summarizing Relational Databases
Table Similarity (cont’d) • Let : R = R0 - R1 - … - Rα= S be a path in G, define • Table similarity (R, S): • Distance (R, S) • dists(R,S) = 1 – strength(R,S) • (R, dists) is a metric space R R1 R0 Rα • ei Ri-1 S Ri • : R = R0 - R1 - … - Rα= S Summarizing Relational Databases
Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases
Clustering: Weighted k-Center • Clustering Criteria: • Minimize the maximum distance between a cluster center and a table in that cluster • Take table importance into consideration, avoid grouping top important tables into one cluster • Weighted k-Center clustering • Weights: table importance • Given k clusters C = {C1, C2, …, Ck}, minimize • NP-Hard Summarizing Relational Databases
Weighted k-Center: Greedy Approximation Algorithm • Start with one cluster, whose center is the top-1 important table. • Iteratively chooses the table Ri whose weighted distance • from its cluster center is largest, and creates a new cluster with Ri as its center. • All tables that are closer to Ri than to their current • cluster center are reassigned to cluster Ci. Summarizing Relational Databases
Outline • Motivation • Our Approach • Define table importance • Define a metric space over the tables • Clustering: Weighted k-Center • Experimental Results • Conclusions Summarizing Relational Databases
Experimental Results • Validate the proposed three components in our approach • Model for table importance IE • Distance function dists • Clustering: Weighted k-Center • Other methods Entropy-based Strength-based Ic: Cardinality-initialized distc = 1 – coverage [1] [1] [1] distp = 1 – proximity [2] [1] C.Yu and H.V.Jagadish. Schema summarization. VLDB 2006. [2] H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD 2007. Summarizing Relational Databases
Experimental Results (cont’d) • Data Sets: TPCE schema • Benchmark database simulating OLTP workload • 33 tables pre-classified into 4 categories • Two database instances: TPCE-1 / TPCE-2 • Affect the size of the majority of tables • Affect Pr(R.AS.B), strength(R,S) for most pairs and mafe for 1/3 of edges Summarizing Relational Databases
TPCE Summarizing Relational Databases
Table Importance • Comparison of IE and IC models • Top-5 Important Tables in IE and their ranks in IC IE more accurate than IC • Top-5 Important Tables in IC and their ranks in IE Summarizing Relational Databases
Table Importance (cont’d) • Consistency of IE and IC models • Top-7 Important Tables in IE and IC for TPCE-1 and TPCE-2 IE more consistent than IC Summarizing Relational Databases
Distance Between Tables • Accuracy of distance functions • Observation: for each table R, its distances to tables within the same category (pre-defined) should be smaller than its distances to tables in different categories • n(R): # top-qnbrs (NNR) of R under dist. d (distS, distC, distP) • m(R): # tables (NNR) in the same category as R under dist. d • Calculate: q=5 distS most accurate Summarizing Relational Databases
Clustering Algorithms Based on IC and distC(coverage) • Accuracy of a summary • TPCE is pre-classified into 4 categories: Broker, Customer,Market and Dimension • m(Ci): # tables in Ci with the same category as center(Ci) • Given a summary C = {C1, C2, …, Ck}, calculate • Balanced-Summary (BS) [1] • Weighted k-Center (WKC) WKC is more accurate Summarizing Relational Databases
Summarization Algorithms • Weighted k-Center over three distance functions distS: most balanced and accurate • Summary Accuracy Summarizing Relational Databases
Related Work • C. Yu and H. V. Jagadish. Schema summarization. VLDB’06 • H.Tong, C.Faloutsos and Y.Koren. Fast direction-aware proximity for graph mining. KDD’07 • W. Wu, B. Reinwald, Y. Sismanis and R. Manjrekar. Discovering topical structures of databases. SIGMOD’08 Summarizing Relational Databases
Conclusions • We proposed a novel approach for summarizing relational schemas • A new model for table importance • A metric distance over schema tables • A summarization algorithm Q&A Summarizing Relational Databases