190 likes | 360 Views
SwissBox. G. Alonso , D. Kossmann, T. Roscoe Systems Group, ETH Zurich http://systems.ethz.ch. Agenda. What we are building ? Why we are building it ?. What is SwissBox ?. [Forrest Gump, Hollywood 1994]. Inside SwissBox (Hardware). N CPU Cores (N = 100, 1000)
E N D
SwissBox G. Alonso, D. Kossmann, T. Roscoe Systems Group, ETH Zurich http://systems.ethz.ch
Agenda • Whatwearebuilding? • Whywearebuildingit?
WhatisSwissBox? [Forrest Gump, Hollywood 1994]
InsideSwissBox (Hardware) • N CPU Cores (N = 100, 1000) • X GB of mainmemory (X = 10xN) • NUMA • dedicate MM foreachcore • Network • heterogeneous (complex) • FPGAs • Somepersistentstorage • Disks orflash (maybe obsolete in futurewith PCM) • Think of (commodity) rackor a multi-coremachine
Sharedi-diskArchitecture Client HTTP XML, JSON, HTML Web Server FCGI, ... XML, JSON, HTML App Server SQL records DB Server get/put block Storage
Sharedi-diskArchitecture Client Client Client Client HTTP XML, JSON, HTML Web Server Workload Splitter XML, JSON, HTML FCGI, ... XML, JSON, HTML DB+App DB+App App Server Predicates, Light Aggr. SQL records Store (e.g., S3) Store (e.g., S3) DistributedStorage DB Server get/put block Storage [Brantner et al. 2008]
{record, {query-ids} } results Queries + Upd. records ClockScan datapartition [Unterbrunner et al. 2009]
SharedDB: Joins • Mass. shareJoins • samejoinpred. • diff. tablepred • (reassemble BO) • Same idea as ClockScan • „sharedjoinscan“ • additional joinpredicate on „query“ [Giannikis et al. 2011]
SwissBox Building Blocks • BarrelfishMulti-kernelOperating System • CPU Driver foreachcore (Barrelfish) • MessagePassing (no sharedmemory!) • Designedforheterogeneous HW (e.g., NUMA) • ClockScan • Storagelayerserves simple predicates + aggregates • Snapshopisolationwithinonepartion • E-CastProtocol • Paxos + consistenthashing • elasticity (online repartioning), SI acrosspartions • SharedDB Operators • massivelysharedjoins, sorts, group-bys... • customprocessing (ifsharingnotworthit) • FPGAs • somespecialalgosforin-networkfiltering / processing
Summary: Design Ideas • SwissBoxis an Appliance • enablesoptimizationacrosslayers • Exploitdata / queryduality • indexqueriesratherthandata • optimizewithknowledge of queries and data • Radicallysimplifieddataflowarchitecture • No indexes, onequery plan for a particularworkload • Merge DB and applicationserverlayers • Save cost and improvepredictability • Shapetheworkload • Force (almost) all operationsinto simple accesspatterns (scan) • Sharedi-diskarchitecture • Great forelasticity, fault tolerance (previouswork on cloud) • Makeuse of capabilities of „storagelayer“ • Great for „inter-query“ parall. (not good for „intra-queryparall.)
Agenda • Whatwearebuilding? • Whywearebuildingit?
Whyarewedoingthis? • Becausewecan... • ... theproofis in thepudding • Interestingresearchartefact • re-address OS/DB co-design • study „battle of thebottlenecks“ • Hardware trends • Hardware changesfasterthansystemssoftware • NUMA, main-memory, heterogeneity • Challengingworkloads and requirements • Predictableperformance, datafreshnessguarantees
Amadeus Workload • Passenger-Booking Database • ~ 600 GB of rawdata (twoyears of bookings) • singletable, denormalized • ~ 50 attributes: flight-no, name, date, ..., manyflags • Query Workload • up to 4000 queries / second • latencyguarantees: 2 seconds • today: onlypre-cannedqueriesallowed • Update Workload • avg. 600 updates per second (1 update per GB per sec) • peak of 12000 updates per second • datafreshnessguarantee: 2 seconds
OtherWorkloads • Logging Service (Amadeus, CreditSuisse) • Log entriesfrom multiple apps and middleware • Maintenance of coarse-grainedindexes (sessionId, ...) • Distributeddebugging, support, auditing • Index look-ups + large scans • Twitter Times (http://www.twittertim.es) • Streams of events / microblogposts (700 / sec) • Maintain simple statisticsincrementally (wordcounts) • Compile a personalizednewspaper of posts • TPC-W style (CreditSuisse, SAP) • Complexqueries + updates
RelatedWork • Appliances • SAP Trex, Netezza, Oracle Exadata, ... • New Data ProcessingArchitectures • All thepreviouspapers of thissession • IBM Blink, MonetDB X100, AsterData, ... • Eddies, data/querydualism, StageDB, QPipes, ... • Nothingwhatwe do isreallynew
Conclusion • Consensus on Starting Point • Great workloads, newapprequirements • (predictability, elasticity, ...) • Technology movingfasterthanever • (MM, multi-core, heterogeneity, cloud, ...) • Building blocksthatfeel right • (ClockScan, multi-kernel, ...) • No consensus (yet) on puttingittogether • Howto composepredictability andelasticity? • „Thejourneyisthedestination“