80 likes | 227 Views
Data Generation for Application-Specific Benchmarking. Y.C. Tay National University of Singapore. Background. benchmarks help research and development --- the dominant database benchmark is TPC. SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%)
E N D
Data GenerationforApplication-Specific Benchmarking Y.C. Tay National University of Singapore
Background benchmarks help research and development --- the dominant database benchmark is TPC SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%) industry track: 14 papers, 6 use TPC (43%) Problem : a few TPC benchmarks but many, many applications TPC becoming irrelevant?
Vision a paradigm shift in database benchmark development from top-downcommittee consensus domain-specific package (data generator + queries) to bottom-up community collaboration application-specific tools (dataset scaling) synthetically scale up/down application data application already has queries
Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. E.g. What would DBLP look like in 2020? s > 1 why: scalability testing difficulty: copying doesn’t work (e.g. social network data) s < 1 why: application testing difficulty: sampling not straightforward (similar to web crawling) s = 1 why: privacy/proprietary reasons difficulty: encryption is risky
Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. by query results difficulty: data correlation E.g. database = {photos, owners, comments, tags} • inter-column correlation • foreign keys • age and gender • user likely to comment • on own photos • gardener likely to tag • photos of flowers • inter-row correlation • photo dimensions • (same camera) • tags used by gardener • (“rose”, “bee”, “beetle”) • inter-column + inter-row • 2 users comment on • each other’s photos • (social network)
Challenge scaling a social network: extract scale by s inject ~ ~ ~ ~ D G G G D D synthetic dataset empirical dataset empirical social graph synthetic social graph use join query use graph theory #edges? #triangles? path lengths? any database theory? E.g. how to inject into * correlation from indicating X and Y comment on each other’s photos * correlation between Alice’s birthday and wall posts by her classmates * correlation among tags used by bird watchers
Challenge * online social networks are here to stay * their datasets can be huge * their datasets have commercial value where is the database theory? Attribute Value Correlation Problem for Social Networks : Suppose a dataset D records data from a social network. How do the social interactions affect the correlation among attribute values in D ?
Vision (for the next 25 years): a paradigm shift from a top-down design of domain-specific benchmarks by committee consensus to a bottom-up collaborative development of tools for application-specific dataset scaling Challenges: • Dataset Scaling Problem • Attribute Value Correlation Problem for Social Networks Payoff: • commercial value in dataset scaling tools • new database research areas (social network data, schema design, • vertical/horizontal partition, query optimization, business intelligence, …) Start: UpSizeR (http:www.comp.nus.edu.sg/~upsizer ) • single-server version • Hadoop version