170 likes | 176 Views
This paper discusses the evaluation of new components in complex DBMS software, focusing on functional vs quality evaluation and the use of a flexible database generator. It introduces the Data Generation Language (DGL) and its features such as data types, iterators, statistical distributions, and SQL extensions. Examples of DGL usage and performance results are also presented.
E N D
Flexible Database Generators Nicolas Bruno Surajit ChaudhuriDMX GroupMicrosoft ResearchVLDB’05
DBMS Components • DBMS are complex pieces of software • Components still evolving/being added
Evaluating New Components • Functional vs. quality evaluation • Black-box vs. gray-box evaluation • Steps: • Generate data • Generate workload • Evaluate improvement Manual task: Time Consuming, sometimes difficult, not reusable
DGL (Data Generation Language) • Special purpose specification language • Based on iterators • Functional flavor (can compose iterators) • Interface with DBMS for scalability • Flexible and extensible • SQL extensions using DGL annotations • Extend CREATE TABLE statements • Specify how a database is populated • Inter-table dependencies are possible
DGL: Data Types • Base: Integers, Strings, Dates, etc. • Rows: heterogeneous sequence of scalars • Inherit operations from scalar types( [1, 4, 5.0] + [2, 3, 4.5] ) ++ [‘John’] = [3, 7, 9.5, ‘John’] • Iterators: key data type • Step(1,5) returns <[1],[2],[3],[4],[5]> • Constant( [1,2] ) returns <[1,2], [1,2], [1,2], … > • Iterators inherit operations from rows: Step(1,5) ++ Step(6,10) = <[1,6], [2,7], [3,8], [4,9], [5,10]> • Associative tables: main memory, random access
Primitive Iterators • Statistical distributions: Uniform, Gaussian, Zipfian, Poisson, etc.Uniform( Constant([0,0]), Constant ([10,10]) )= Uniform ( [0,0], [10,10] ) (implicit casts). • Others: Duplicate elimination, union, etc. • SQL: Bridge DGL and DBMSPersist (expression, [table name])Query (parameterized query, iterator1, …)
Expressions and Functions • Expressions (acyclic reference graph) • Functions
Annotated Schemas with DGL • Annotations: specify how to populate a database • From annotations to DGL: • Create single DGL fragment for all annotations • Vertical partitions for inter-table dependencies • Query rewriting • Proxy introduction
Example: Inter-table dependencies • Employees’ ages normally distributed around 40 • Employees’ departments follows Zipfian distribution • Employees’ bonus distributed around department 'category‘, which depends on budget • Dept budget is normally distributed around 10000 * size (number of employees) • Dept building follows Zipfian distribution LETemployeeempID = Top ( Step ( 0, 10000 ), employeeage = Top ( Normal ( 40.00, 5.00 ), 10000 ), deptbuilding = Zipfian ( 1.00, 20 ), employeedeptIDProxy = Persist ( Top ( Zipfian ( 0.75, 50 ), 10000 ) ), employeedeptID = Query ( "SELECT * FROM <<0>>", employeedeptIDProxy ), deptdeptIDsize = Query ( "SELECT <<0>>.v0, count(*) FROM <<0>> GROUP BY <<0>>.v0", employeedeptIDProxy ), deptbudget = Normal ( 10000 * deptdeptIDsize_1, 5000 ), tmp1Proxy = Persist ( deptbudget & deptdeptIDsize ), tmp2Proxy = Persist ( employeedeptID & employeeempID ), employeeempBaseBonus = Top ( Query ( "SELECT <<0>>.v0 / 1000 FROM <<1>> JOIN <<0>> ON <<1>>.v0 = <<0>>.v1 ORDER BY <<1>>.v1", tmp1Proxy, tmp2Proxy ), 10000 ), ... IN Union ( Persist ( employeeempID & employeeage & employeedeptID & employeebonus, "employee" ), Persist ( deptdeptIDsize_0 & deptbudget & deptbuilding, "dept" ) ) CREATE TABLE dept ( deptID int, budget float, building int ) POPULATE ( (deptID, size) = Query(" SELECT employee.deptID, count(*) FROM employee GROUP BY employee.deptID") budget = Normal(10000*size, 5000) building = Zipfian(1.0, 20) ) CREATE TABLE employee ( empID int, age int, deptID int, bonus int ) POPULATE 10000 ( empID = Step(0,10000), age = Normal(40.0,5.0), deptID = Zipfian(0.75, 50), empBaseBonus = Query(" SELECT D.budget / 1000 FROM employee JOIN dept ON employee.deptID = dept.deptID ORDER BY employee.empID"), bonus = empBaseBonus * Uniform(0.5,1.5) )
Evaluation Model • Iterator model (open/getNext/close) • Program is DAG • Depending on consumers, buffering is required • In-memory circular queue that spills to disk
Examples • Multi-gaussian • Wisconsin Benchmark • Skewed primary/foreign key joins
Orders arrivals follow a Poisson distribution starting in ‘1992/01/01’ Number of items per order follows a Zipfian distribution.Ship date occurs k days after order date, where k follows Zipfian. Commit and receipt dates follow a bi-gaussian distribution after ship date. Item discounts are correlated to the global number of parts sold Top 100 customers’ debt is normally distributed around 3*balances.Remaining customers, around balances/2. All parts in an order are sold by suppliers that live in the samecountry as the customer. Complex TPC-H Examples
Initial Performance Results Populate 1GB databases with various generators
Conclusion • Creating datasets for quality evaluation of new database components is time-consuming • DGL is expressive and easy to use • SQL annotations reduce time needed to create and populate databases with non-trivial correlations