1 / 8

Data Generation for Application-Specific Benchmarking

Data Generation for Application-Specific Benchmarking. Y.C. Tay National University of Singapore. Background. benchmarks help research and development --- the dominant database benchmark is TPC. SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%)

chaeli
Download Presentation

Data Generation for Application-Specific Benchmarking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data GenerationforApplication-Specific Benchmarking Y.C. Tay National University of Singapore

  2. Background benchmarks help research and development --- the dominant database benchmark is TPC SIGMOD Conference 2011 research track: 87 papers, 17 use TPC (20%) industry track: 14 papers, 6 use TPC (43%) Problem : a few TPC benchmarks but many, many applications TPC becoming irrelevant?

  3. Vision a paradigm shift in database benchmark development from top-downcommittee consensus domain-specific package (data generator + queries) to bottom-up community collaboration application-specific tools (dataset scaling) synthetically scale up/down application data application already has queries

  4. Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. E.g. What would DBLP look like in 2020? s > 1 why: scalability testing difficulty: copying doesn’t work (e.g. social network data) s < 1 why: application testing difficulty: sampling not straightforward (similar to web crawling) s = 1 why: privacy/proprietary reasons difficulty: encryption is risky

  5. Challenge Dataset Scaling Problem : Given a set of relational tables D and a scale factor s, generate a database state D’ that is similar to D but s times its size. by query results difficulty: data correlation E.g. database = {photos, owners, comments, tags} • inter-column correlation • foreign keys • age and gender • user likely to comment • on own photos • gardener likely to tag • photos of flowers • inter-row correlation • photo dimensions • (same camera) • tags used by gardener • (“rose”, “bee”, “beetle”) • inter-column + inter-row • 2 users comment on • each other’s photos • (social network)

  6. Challenge scaling a social network: extract scale by s inject ~ ~ ~ ~ D G G G D D synthetic dataset empirical dataset empirical social graph synthetic social graph use join query use graph theory #edges? #triangles? path lengths? any database theory? E.g. how to inject into * correlation from indicating X and Y comment on each other’s photos * correlation between Alice’s birthday and wall posts by her classmates * correlation among tags used by bird watchers

  7. Challenge * online social networks are here to stay * their datasets can be huge * their datasets have commercial value where is the database theory? Attribute Value Correlation Problem for Social Networks : Suppose a dataset D records data from a social network. How do the social interactions affect the correlation among attribute values in D ?

  8. Vision (for the next 25 years): a paradigm shift from a top-down design of domain-specific benchmarks by committee consensus to a bottom-up collaborative development of tools for application-specific dataset scaling Challenges: • Dataset Scaling Problem • Attribute Value Correlation Problem for Social Networks Payoff: • commercial value in dataset scaling tools • new database research areas (social network data, schema design, • vertical/horizontal partition, query optimization, business intelligence, …) Start: UpSizeR (http:www.comp.nus.edu.sg/~upsizer ) • single-server version • Hadoop version

More Related