1 / 25

Summary Graphs for Relational Database Schemas

Summary Graphs for Relational Database Schemas. Xiaoyan Yang (NUS) Cecilia M. Procopiuc , Divesh Srivastava (AT&T). Motivation. Complex database schemas in large enterprise systems 1000+ tables, 10000+ columns, 100000+ PK/FK edges Prior work to help users understand complex schemas

cerise
Download Presentation

Summary Graphs for Relational Database Schemas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, DiveshSrivastava (AT&T)

  2. Motivation • Complex database schemas in large enterprise systems • 1000+ tables, 10000+ columns, 100000+ PK/FK edges • Prior work to help users understand complex schemas • Customized views (forms) to hide database schema • Present informative tables to simplify schema understanding • Goal: schema graph summary connecting user’s query tables • Needs to be succinct • Needs to preserve informative join paths

  3. Complex Schema Graph Example • Complex database schema in a large real enterprise system • Too complex for illustrative purposes

  4. TPC-E Benchmark Schema Graph

  5. Useless TPC-E Schema Summary Graph trade status_type customer security Graph weight = 4.5572034455 • Not very informative: all query tables have a status_type field • Succinct graph does not mean informative graph!

  6. Informative TPC-E Schema Summary Graph trade customer_account holding_summary customer security Graph weight = 1.6917276155 • Very informative: securities held by, trades made by customer • Larger graph, smaller graph weight, union of shortest paths

  7. Useless TPC-E Schema Summary Graph • Union of pairwise shortest paths is not the answer • Small graph weight, but verbosity hinders understandability

  8. Succinct TPC-E Schema Summary Graph broker 0.7298749340 customer_account exchange 1.9574738210 0.2947410428 1.0944249463 customer_taxrate commission_rate industry 0.7470561327 1.4236398511 1.2674994678 • Informative & succinct: customer_account, exchange are hubs • Slightly larger graph weight, but informative and succinct Graph weight = 7.5147101957

  9. Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results

  10. Desiderata • Schema graph summary must be informative and succinct • Need a formal definition of “informative” • Use Information Theory • Need a formal definition of “succinct” • Use Graph Summarization

  11. Problem Statement 1: Informative Edges • Given schema graph G = (R, E) and database instance D • Problem 1: define schema edge weights, wt: E R+ • More informative join edges have smaller weights (≥ 0) • Extend wt(R1, R2) = weight of shortest path between R1 and R2

  12. Problem Statement 2: Succinct Graph • Given schema G = (R, E), weight wt, user-specified tables Q • Problem 2: compute summary graph (Rs, Es) • Q  Rs  R, |Rs| ≤ |Q|+B, for a given small budget B • Meta-edges Es  {(R1, R2) | exists path between R1 and R2 in G} • (Rs, Es) must preserve shortest paths between Q tables in G • Optimize: (Rs, Es) has the minimum sum of meta-edge weights

  13. Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results

  14. Informative Edges: Column Graph D E • Build an edge weighted column graph GC = (NC, EC) where • NC consists of all primary and foreign key columns in all tables • Intra-table edges in EC = {(R.P, R.F) | R.P is a PK column of R} • Inter-table edges in EC = {(R.P, R1.F) | R1.F is a foreign key to R.P} • Edge weights based on mutual information between columns B 0.1 0.05 0.28 S 0.21 R A 0.5 C T 0.6 F

  15. Informative Edges: Table Graph S D E • Induce an edge weighted table graph GT = (NT, ET) where • NT consists of all tables • ET = {(R, R1) | R1.F is a foreign key to R.P} • Edge weight = min sum of weights on path between PK columns B 0.1 0.05 0.38 0.28 0.26 S R 0.21 R A 0.5 1.1 C T 0.6 F T

  16. Edge Weight: Using Mutual Information H(X,Y) • Mutual information I(X;Y) = SxSy p(x,y) log2(p(x,y)/p(x)p(y)) • Mutual information captures strength of linkage between X, Y • D(X,Y) = 1 – H(X,Y)/I(X;Y) is a distance function, H() is entropy • D(X,Y) = 0 iff X, Y are identical; D(X,Y) = 1 iff X, Y are independent H(X|Y) I(X;Y) H(X) H(Y) H(Y|X) I(X;Y) = 1.5 H(X,Y) = 2.0, D(X,Y) = 0.25 i(x;y)

  17. Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results

  18. Summary Graph • Given schema graph G = (R, E), edge weight wt: E R+, and user-specified tables Q, compute summary graph (Rs, Es) • Q  Rs  R, |Rs| ≤ |Q|+B, for a given small budget B • Meta-edges Es  {(R1, R2) | exists path between R1 and R2 in G} • (Rs, Es) must preserve shortest paths between Q tables in G • Optimize: (Rs, Es) has the minimum sum of meta-edge weights 0.5 R 0.1 0.2 0.3 R R 0.1 0.3 0.3 0.1 0.5 A B T 0.2 S 0.3 0.1 B T A T S S Total weight = 1.2 Total weight = 1.1 Total weight = 0.7

  19. Properties of Summary Graphs • Theorem: Computing the optimal summary graph is NP-hard Proof uses reduction from Clique in (n – 4)-regular graphs • Proposition (towards an elegant solution formulation): • It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes • Endpoints of meta-edges in optimal summary graph have to appear together on at least one shortest path between Q nodes

  20. Efficient Computation of Summary Graphs • It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes • Elegant solution: formulate an integer program; use CPLEX

  21. Outline • Motivation • Problem statement • Our solution • Defining schema edge weights • Computing summary graphs • Experimental results

  22. Experimental Setup • Data: use 2 instances of TPC-E benchmark database schema • Simulates an OLTP workload of a brokerage firm • Well-specified schema, including PK/FK constraints • Quality: use measures based on the TPC-E transaction logs • Table coverage: relative frequency of summary graph tables in log • Join coverage: relative frequency of summary graph joins in log • Summary graph density: reflects complexity of summary graph

  23. Comparing Weight Functions • Compare MI-based and MAF-based [YPS09] edge weights • Fixed B, varying |Q|; fixed |Q|, varying B • MI-based weight is superior: higher coverage, lower density

  24. Choosing Budget Tables • Effect of our strategy for choosing budget tables • Use coordinated summary graphs for fixed |Q|+B • Budget nodes reduce complexity, improve quality

  25. Summary • Complex database schemas in large enterprise systems • 1000+ tables, 10000+ columns, 100000+ PK/FK edges • Novel schema graph summary is informative and succinct • Define schema graph edge weights using mutual information • Compute succinct summary graph that preserves query table shortest paths and minimizes graph weight, for a given budget • Experimental study validates weight definition, summary model • Future work: approximations for schema graph summaries

More Related