1 / 17

Flexible Database Generators

Evaluating new DBMS components with Data Generation Language (DGL) for flexible and extensible database population. Explore DGL data types, iterators, annotations, and examples for efficient database population.

marysmall
Download Presentation

Flexible Database Generators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Database Generators Nicolas Bruno Surajit ChaudhuriDMX GroupMicrosoft ResearchVLDB’05

  2. DBMS Components • DBMS are complex pieces of software • Components still evolving/being added

  3. Evaluating New Components • Functional vs. quality evaluation • Black-box vs. gray-box evaluation • Steps: • Generate data • Generate workload • Evaluate improvement Manual task: Time Consuming, sometimes difficult, not reusable

  4. DGL (Data Generation Language) • Special purpose specification language • Based on iterators • Functional flavor (can compose iterators) • Interface with DBMS for scalability • Flexible and extensible • SQL extensions using DGL annotations • Extend CREATE TABLE statements • Specify how a database is populated • Inter-table dependencies are possible

  5. Populating a Database with DGL

  6. DGL: Data Types • Base: Integers, Strings, Dates, etc. • Rows: heterogeneous sequence of scalars • Inherit operations from scalar types( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’] • Iterators: key data type • Step(1,5) returns <[1],[2],[3],[4],[5]> • Constant( [1,2] ) returns <[1,2], [1,2], [1,2], … > • Iterators inherit operations from rows: Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]> • Associative tables: main memory, random access

  7. Primitive Iterators • Statistical distributions: Uniform, Gaussian, Zipfian, Poisson, etc.Uniform( Constant([0,0]), Constant ([10,10]) )= Uniform ( [0,0], [10,10] ) (implicit casts). • Others: Duplicate elimination, union, etc. • SQL: Bridge DGL and DBMSPersist (expression, [table name])Query (parameterized query, iterator1, …)

  8. Expressions and Functions • Expressions (acyclic reference graph) • Functions

  9. Annotated Schemas with DGL • Annotations: specify how to populate a database • From annotations to DGL: • Create single DGL fragment for all annotations • Vertical partitions for inter-table dependencies • Query rewriting • Proxy introduction

  10. Populating a Database with DGL

  11. Example: Multidimensional Distributions

  12. Example: Inter-table dependencies • Employees’ ages normally distributed around 40 • Employees’ departments follows Zipfian distribution • Employees’ bonus distributed around department 'category‘, which depends on budget • Dept budget is normally distributed around 10000 * size (number of employees) • Dept building follows Zipfian distribution LETemployeeempID = Top ( Step ( 0, 10000 ), employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ), deptbuilding = Zipfian ( 1.00, 20 ), employeedeptIDProxy = Persist ( Top ( Zipfian ( 0.75, 50 ), 10000 ) ), employeedeptID = Query ( "SELECT * FROM <<0>>", employeedeptIDProxy ), deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0", employeedeptIDProxy ), deptbudget = Normal ( 10000 * deptdeptIDsize_1, 5000 ), tmp1Proxy = Persist ( deptbudget & deptdeptIDsize ), tmp2Proxy = Persist ( employeedeptID & employeeempID ), employeeempBaseBonus = Top ( Query ( "SELECT <<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON <<1>>.v0 = <<0>>.v1 ORDER BY <<1>>.v1", tmp1Proxy, tmp2Proxy ), 10000 ), ... IN Union ( Persist ( employeeempID & employeeage & employeedeptID & employeebonus, "employee" ), Persist ( deptdeptIDsize_0 & deptbudget & deptbuilding, "dept" ) ) CREATE TABLE dept ( deptID int, budget float, building int ) POPULATE ( (deptID, size) = Query(" SELECT employee.deptID, count(*) FROM employee GROUP BY employee.deptID") budget = Normal(10000*size, 5000) building = Zipfian(1.0, 20) ) CREATE TABLE employee ( empID int, age int, deptID int, bonus int ) POPULATE 10000 ( empID = Step(0,10000), age = Normal(40.0,5.0), deptID = Zipfian(0.75, 50), empBaseBonus = Query(" SELECT D.budget / 1000 FROM employee JOIN dept ON employee.deptID = dept.deptID ORDER BY employee.empID"), bonus = empBaseBonus * Uniform(0.5,1.5) )

  13. Evaluation Model • Iterator model (open/getNext/close) • Program is DAG • Depending on consumers, buffering is required • In-memory circular queue that spills to disk

  14. Examples • Multi-gaussian • Wisconsin Benchmark • Skewed primary/foreign key joins

  15. Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’ Number of items per order follows a Zipfian distribution.Ship date occurs k days after order date, where k follows Zipfian. Commit and receipt dates follow a bi-gaussian distribution after ship date. Item discounts are correlated to the global number of parts sold Top 100 customers’ debt is normally distributed around 3*balances.Remaining customers, around balances/2. All parts in an order are sold by suppliers that live in the samecountry as the customer. Complex TPC-H Examples

  16. Initial Performance Results Populate 1GB databases with various generators

  17. Conclusion • Creating datasets for quality evaluation of new database components is time-consuming • DGL is expressive and easy to use • SQL annotations reduce time needed to create and populate databases with non-trivial correlations

More Related