Flexible Database Generators

Flexible Database Generators Nicolas Bruno Surajit ChaudhuriDMX GroupMicrosoft ResearchVLDB’05

DBMS Components • DBMS are complex pieces of software • Components still evolving/being added

Evaluating New Components • Functional vs. quality evaluation • Black-box vs. gray-box evaluation • Steps: • Generate data • Generate workload • Evaluate improvement Manual task: Time Consuming, sometimes difficult, not reusable

DGL (Data Generation Language) • Special purpose specification language • Based on iterators • Functional flavor (can compose iterators) • Interface with DBMS for scalability • Flexible and extensible • SQL extensions using DGL annotations • Extend CREATE TABLE statements • Specify how a database is populated • Inter-table dependencies are possible

Populating a Database with DGL

DGL: Data Types • Base: Integers, Strings, Dates, etc. • Rows: heterogeneous sequence of scalars • Inherit operations from scalar types( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’] • Iterators: key data type • Step(1,5) returns <[1],[2],[3],[4],[5]> • Constant( [1,2] ) returns <[1,2], [1,2], [1,2], … > • Iterators inherit operations from rows: Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]> • Associative tables: main memory, random access

Primitive Iterators • Statistical distributions: Uniform, Gaussian, Zipfian, Poisson, etc.Uniform( Constant([0,0]), Constant ([10,10]) )= Uniform ( [0,0], [10,10] ) (implicit casts). • Others: Duplicate elimination, union, etc. • SQL: Bridge DGL and DBMSPersist (expression, [table name])Query (parameterized query, iterator1, …)

Expressions and Functions • Expressions (acyclic reference graph) • Functions

Annotated Schemas with DGL • Annotations: specify how to populate a database • From annotations to DGL: • Create single DGL fragment for all annotations • Vertical partitions for inter-table dependencies • Query rewriting • Proxy introduction

Populating a Database with DGL

Example: Multidimensional Distributions

Example: Inter-table dependencies • Employees’ ages normally distributed around 40 • Employees’ departments follows Zipfian distribution • Employees’ bonus distributed around department 'category‘, which depends on budget • Dept budget is normally distributed around 10000 * size (number of employees) • Dept building follows Zipfian distribution LETemployeeempID = Top ( Step ( 0, 10000 ), employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ), deptbuilding = Zipfian ( 1.00, 20 ), employeedeptIDProxy = Persist ( Top ( Zipfian ( 0.75, 50 ), 10000 ) ), employeedeptID = Query ( "SELECT * FROM <<0>>", employeedeptIDProxy ), deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0", employeedeptIDProxy ), deptbudget = Normal ( 10000 * deptdeptIDsize_1, 5000 ), tmp1Proxy = Persist ( deptbudget & deptdeptIDsize ), tmp2Proxy = Persist ( employeedeptID & employeeempID ), employeeempBaseBonus = Top ( Query ( "SELECT <<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON <<1>>.v0 = <<0>>.v1 ORDER BY <<1>>.v1", tmp1Proxy, tmp2Proxy ), 10000 ), ... IN Union ( Persist ( employeeempID & employeeage & employeedeptID & employeebonus, "employee" ), Persist ( deptdeptIDsize_0 & deptbudget & deptbuilding, "dept" ) ) CREATE TABLE dept ( deptID int, budget float, building int ) POPULATE ( (deptID, size) = Query(" SELECT employee.deptID, count(*) FROM employee GROUP BY employee.deptID") budget = Normal(10000*size, 5000) building = Zipfian(1.0, 20) ) CREATE TABLE employee ( empID int, age int, deptID int, bonus int ) POPULATE 10000 ( empID = Step(0,10000), age = Normal(40.0,5.0), deptID = Zipfian(0.75, 50), empBaseBonus = Query(" SELECT D.budget / 1000 FROM employee JOIN dept ON employee.deptID = dept.deptID ORDER BY employee.empID"), bonus = empBaseBonus * Uniform(0.5,1.5) )

Evaluation Model • Iterator model (open/getNext/close) • Program is DAG • Depending on consumers, buffering is required • In-memory circular queue that spills to disk

Examples • Multi-gaussian • Wisconsin Benchmark • Skewed primary/foreign key joins

Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’ Number of items per order follows a Zipfian distribution.Ship date occurs k days after order date, where k follows Zipfian. Commit and receipt dates follow a bi-gaussian distribution after ship date. Item discounts are correlated to the global number of parts sold Top 100 customers’ debt is normally distributed around 3*balances.Remaining customers, around balances/2. All parts in an order are sold by suppliers that live in the samecountry as the customer. Complex TPC-H Examples

Initial Performance Results Populate 1GB databases with various generators

Conclusion • Creating datasets for quality evaluation of new database components is time-consuming • DGL is expressive and easy to use • SQL annotations reduce time needed to create and populate databases with non-trivial correlations

Flexible Database Generators