250 likes | 407 Views
Summary Graphs for Relational Database Schemas. Xiaoyan Yang (NUS) Cecilia M. Procopiuc , Divesh Srivastava (AT&T). Motivation. Complex database schemas in large enterprise systems 1000+ tables, 10000+ columns, 100000+ PK/FK edges Prior work to help users understand complex schemas
E N D
Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, DiveshSrivastava (AT&T)
Motivation • Complex database schemas in large enterprise systems • 1000+ tables, 10000+ columns, 100000+ PK/FK edges • Prior work to help users understand complex schemas • Customized views (forms) to hide database schema • Present informative tables to simplify schema understanding • Goal: schema graph summary connecting user’s query tables • Needs to be succinct • Needs to preserve informative join paths
Complex Schema Graph Example • Complex database schema in a large real enterprise system • Too complex for illustrative purposes
Useless TPC-E Schema Summary Graph trade status_type customer security Graph weight = 4.5572034455 • Not very informative: all query tables have a status_type field • Succinct graph does not mean informative graph!
Informative TPC-E Schema Summary Graph trade customer_account holding_summary customer security Graph weight = 1.6917276155 • Very informative: securities held by, trades made by customer • Larger graph, smaller graph weight, union of shortest paths
Useless TPC-E Schema Summary Graph • Union of pairwise shortest paths is not the answer • Small graph weight, but verbosity hinders understandability
Succinct TPC-E Schema Summary Graph broker 0.7298749340 customer_account exchange 1.9574738210 0.2947410428 1.0944249463 customer_taxrate commission_rate industry 0.7470561327 1.4236398511 1.2674994678 • Informative & succinct: customer_account, exchange are hubs • Slightly larger graph weight, but informative and succinct Graph weight = 7.5147101957
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Desiderata • Schema graph summary must be informative and succinct • Need a formal definition of “informative” • Use Information Theory • Need a formal definition of “succinct” • Use Graph Summarization
Problem Statement 1: Informative Edges • Given schema graph G = (R, E) and database instance D • Problem 1: define schema edge weights, wt: E R+ • More informative join edges have smaller weights (≥ 0) • Extend wt(R1, R2) = weight of shortest path between R1 and R2
Problem Statement 2: Succinct Graph • Given schema G = (R, E), weight wt, user-specified tables Q • Problem 2: compute summary graph (Rs, Es) • Q Rs R, |Rs| ≤ |Q|+B, for a given small budget B • Meta-edges Es {(R1, R2) | exists path between R1 and R2 in G} • (Rs, Es) must preserve shortest paths between Q tables in G • Optimize: (Rs, Es) has the minimum sum of meta-edge weights
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Informative Edges: Column Graph D E • Build an edge weighted column graph GC = (NC, EC) where • NC consists of all primary and foreign key columns in all tables • Intra-table edges in EC = {(R.P, R.F) | R.P is a PK column of R} • Inter-table edges in EC = {(R.P, R1.F) | R1.F is a foreign key to R.P} • Edge weights based on mutual information between columns B 0.1 0.05 0.28 S 0.21 R A 0.5 C T 0.6 F
Informative Edges: Table Graph S D E • Induce an edge weighted table graph GT = (NT, ET) where • NT consists of all tables • ET = {(R, R1) | R1.F is a foreign key to R.P} • Edge weight = min sum of weights on path between PK columns B 0.1 0.05 0.38 0.28 0.26 S R 0.21 R A 0.5 1.1 C T 0.6 F T
Edge Weight: Using Mutual Information H(X,Y) • Mutual information I(X;Y) = SxSy p(x,y) log2(p(x,y)/p(x)p(y)) • Mutual information captures strength of linkage between X, Y • D(X,Y) = 1 – H(X,Y)/I(X;Y) is a distance function, H() is entropy • D(X,Y) = 0 iff X, Y are identical; D(X,Y) = 1 iff X, Y are independent H(X|Y) I(X;Y) H(X) H(Y) H(Y|X) I(X;Y) = 1.5 H(X,Y) = 2.0, D(X,Y) = 0.25 i(x;y)
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Summary Graph • Given schema graph G = (R, E), edge weight wt: E R+, and user-specified tables Q, compute summary graph (Rs, Es) • Q Rs R, |Rs| ≤ |Q|+B, for a given small budget B • Meta-edges Es {(R1, R2) | exists path between R1 and R2 in G} • (Rs, Es) must preserve shortest paths between Q tables in G • Optimize: (Rs, Es) has the minimum sum of meta-edge weights 0.5 R 0.1 0.2 0.3 R R 0.1 0.3 0.3 0.1 0.5 A B T 0.2 S 0.3 0.1 B T A T S S Total weight = 1.2 Total weight = 1.1 Total weight = 0.7
Properties of Summary Graphs • Theorem: Computing the optimal summary graph is NP-hard Proof uses reduction from Clique in (n – 4)-regular graphs • Proposition (towards an elegant solution formulation): • It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes • Endpoints of meta-edges in optimal summary graph have to appear together on at least one shortest path between Q nodes
Efficient Computation of Summary Graphs • It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes • Elegant solution: formulate an integer program; use CPLEX
Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results
Experimental Setup • Data: use 2 instances of TPC-E benchmark database schema • Simulates an OLTP workload of a brokerage firm • Well-specified schema, including PK/FK constraints • Quality: use measures based on the TPC-E transaction logs • Table coverage: relative frequency of summary graph tables in log • Join coverage: relative frequency of summary graph joins in log • Summary graph density: reflects complexity of summary graph
Comparing Weight Functions • Compare MI-based and MAF-based [YPS09] edge weights • Fixed B, varying |Q|; fixed |Q|, varying B • MI-based weight is superior: higher coverage, lower density
Choosing Budget Tables • Effect of our strategy for choosing budget tables • Use coordinated summary graphs for fixed |Q|+B • Budget nodes reduce complexity, improve quality
Summary • Complex database schemas in large enterprise systems • 1000+ tables, 10000+ columns, 100000+ PK/FK edges • Novel schema graph summary is informative and succinct • Define schema graph edge weights using mutual information • Compute succinct summary graph that preserves query table shortest paths and minimizes graph weight, for a given budget • Experimental study validates weight definition, summary model • Future work: approximations for schema graph summaries