190 likes | 355 Views
MIDDLEWARE SYSTEMS. RESEARCH GROUP. Rapid Development of Data Generators Using Meta Generators in PDGF. MSRG .ORG. Tilmann Rabl, Meikel Poess, Manuel Danisch , Hans-Arno Jacobsen DBTest 2013, June 24, New York City. DBMS Benchmarking is Increasingly Complex.
E N D
MIDDLEWARE SYSTEMS RESEARCH GROUP Rapid Development of Data Generators Using Meta Generators in PDGF MSRG.ORG Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City
DBMS Benchmarking is Increasingly Complex • Data Volumes are sky rocketing • Enterprise data warehouses double every three years • Many enterprise data warehouses are in petabyte size • Systems are becoming increasingly complex • Large number of processor cores • Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware) • Multi node systems (sky is the limit) • Large memory • Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems • How to challenge these systems?
Benchmarks are increasingly complex • More tables, columns • More relationships, dependencies, data types, … • How to build these benchmarks? • Parallel Data Generation Framework to the rescue!
Parallel Data Generation Framework • Generic data generation framework • Relational model • Schema specified in configuration file • Post-processing stage for alternative representations • Repeatable computation • Based on XORSHIFT random number generators • Hierarchical seeding strategy
Repeatable Data Generation • Data generation based on random numbers • More specifically parallel random number generation • Generation of numbers within range (e.g., age) • What if we want NULL values? • Repeat that logic in every generator?
PDGF Architecture • Controller Initialization • Meta Scheduler Internodescheduling • Scheduler Interthreadscheduling • Worker Blockwisedatageneration • Update Black Box Co-ordination ofdataupdates • Seeding System Random sequenceadaption • Generators Value generation • Output system Data formating • Togeneratedatafor a schematheuserdefines: • Schema XML file • Defines relational schema • Generation XML file • Definesoutputformat (CSV, XML, mergingtables)
Configuring PDGF • Schema configuration • Data model • Relational model • Tables, fields • Properties • Table size, characters, … • Generators • Base generators • Meta generators • Update definition • Insert, update, delete • Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..]
Base Generators in PDGF • DictList generator • Random line from file • Long generator • Random long in interval • Others • StaticValue • Double • Date • String • Text • … <table name="users"> <size>10000</size> <fields> <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <gen_DictList> <file>dicts/names.dict</file> </gen_DictList> </field> <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </field> </fields> </table>
Null Generator • Add NULL logic to every generator? • Could easily be implemented in higher class • Adds to the configuration file • Reduces performance (every time) • Higher order generator NullGenerator • Only used if added to the schema • Can be added to any generator <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_NullGenerator> <probability>0.05</probability> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </gen_NullGenerator> </field>
Meta Generators • Control flow and post-processing generators • Null generator controls flow • Post-processing • FormattedNumberGenerator • PaddingGenerator • UpperLowerCaseGenerator • PrePostfixGenerator • FormulaGenerator • Flow control • ProbabilityGenerator • SequentialGenerator • IfGenerator • SwitchGenerator • ReferenceGenerator
Post-Processing Example • Phone number for users • 10s of representations • PhoneNumberGenerator was too inflexible • Formatted long number • Long numbers between 10010001 and 9999999999 • Number formatting (%d%d%d) %d%d%d-%d%d%d%d <field name="phonenumber"> <type>java.sql.types.VARCHAR</type> <size>30</size> <generator name="FormattedNumberGenerator"> <generator name="LongGenerator"> <min>10010001</min> <max>9999999999</max> </generator> <format>(%d%d%d) %d%d%d-%d%d%d%d</format> </generator> </field>
Flow Control Example • More elaborate name field • Name male or female • 50% chance • All upper case • Padded to 100 characters • Sequential generator • Probability generator • DictList generator • UpperLowerCase generator • Padding generator <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <generator name="SequentialGenerator"> <generator name="ProbabilityGenerator"> <probability value="0.5"> <generator name="DictList"> <file>dicts/female.dict</file> </generator> </probability> <probability value="0.5"> <generator name="DictList"> <file>dicts/male.dict</file> </generator> </probability> </generator> <generator name="UpperLowerCaseGenerator"> <mode>uppercase</mode> </generator> <generator name="PaddingGenerator"> <character> </character> <padToLeft>true</padToLeft> </generator> </generator> </field>
Core Performance • Test environment: single core laptop, no I/O • Base time for framework ~ 55 ns (Base Time) • Seeding, method invocation, setting a value • Computation time for generator 50+ ns (Gen Time) • Cache update if referenced ~ 50 ns (Cache Update) • Cache lookup if intra row reference ~ 50 ns (Cache Lookup) • Sub-generator invocation ~ 50 ns
Performance Basic Generators • Basic generators without formatting • 120ns – 510ns
Performance Formatted Values • Basic Generators with formatting • Usually > 1000ns
Performance Meta Generators • Meta generator overhead: • Base overhead ~ 50 ns • Generator overhead starts from 50 ns • Sub generator invocation ~ 50ns • Often negligible due to lazy formatting
Use Cases • TPC-H / SSB • 8 tables, 61 columns (first non-trivial example) • Without meta-FVGs: 26 custom FVGs • 2h editing: 10 custom FVGs • 1 day reimplementation: 0 custom FVGs, i.e. no coding • SSB variations • skews on dimension attributes, fact measures, references • TPC-DI (in process) • 20 tables, 200 columns • 19 custom FVGs (mainly for performance in corner cases) • 56x NullGenerator • 32x ProbabilityGenerator • 3000 lines of config (XML import for multiple files).
Conclusion & Future Work • Meta generators • Improve usability and expressiveness • Speed up schema definition • Remove necessity for coding • Enlarged configuration files • Used in TPC benchmark(s) • Performance overhead is small, often negligible • Future work • GUI and SQL export • SQL import and data extraction
Thanks • Questions? • Contact: tilmann.rabl@utoronto.ca • Download and try PDGF: • http://www.paralleldatageneration.org • Some big data info in our BigBench presentation • Tuesday, 4pm, Industry 3