180 likes | 319 Views
Resiliency-Aware Data Management Matthias Boehm 1 Wolfgang Lehner 1 Christof Fetzer 2 TU Dresden 1 Database Technology Group 2 Systems Engineering Group August 30, 2011. Motivation: Increasing Error Rates. Increasing Component Error Rates
E N D
Resiliency-Aware Data ManagementMatthias Boehm1 Wolfgang Lehner1 Christof Fetzer2TU Dresden 1 Database Technology Group2 Systems Engineering GroupAugust 30, 2011
Motivation: Increasing Error Rates • IncreasingComponent Error Rates • Decreasingfeaturesizes (newtechgenerations) • Reducedvoltagesupply • Static (hard) vs. dynamic (soft) errors • 8% increaseerror rate per techgeneration[Borkar05] • 25,000 – 70,000 FIT / Mbit [Schroeder09] • Increasing System Error Rates • Increasingscale • # ofcomponents (core, transistor) • Memory capacities • Example: • Fixed error rate / component Cosmic Radiation(95% neutrons) Mem CPU 1 1 1 1 1 1 P( )=0.01 P( )=0.01 P( )=0.01 P( )=0.01 P( )=0.01 (at least onecomponentfails) P( )=0.039 Errors and error-prone behavior will become the normal case Resiliency-Aware Data Management
Motivation: ResiliencyCosts • Implicit (silent) vs. Explicit (detected/corrected) Errors • State-of-the-art: errordetectionandcorrectionat HW/OS level • State-of-the-Art: Resilient Memory • ECC / paritybits / memoryscrubbing / fulldataredundancy • State-of-the-Art: Resilient Computing • Computationredundancy (8,4) (16,11) (32,26) • ECC Extended Hamming(7+1,4) (64,57) Task A Triple Modular Redundancy(TMR): Double Modular Redundancy(DMR): Task A Task A‘ voting =? Task A‘ Task A‘‘ Such resiliencymechanismscause „resiliencycosts“ Resiliency-Aware Data Management
Motivation: ResiliencyCosts (2) • ResiliencyCostsCategories • Performance overhead (throughput, latency) • Memory overhead • Energyconsumption • Monetary HW costs • ResiliencyCosts@ OS-Level • Memory overhead(capacity, bandwidth) • Computationoverhead • Energyconsumption (increased time) • ResiliencyCosts@ HW-Level • Monetary HW costs(Chipset, ECC RAM) • Energyconsumption (time, chipspace) • Computationoverhead Data Management OS / Middleware OS / Middleware HW Infrastructure HW Infrastructure 0 1 2 3 CPU L3 ECC memcontrol Memory ECC RAM ECC RAM Increasingerrorrates ~ increasingresiliencycosts! Resiliency-Aware Data Management
Vision ofResiliency-Aware Data Management Resiliency-Aware Data Management
Vision Overview nice-to-haveanalytics • Problem of State-of-the-Art • Resiliency-awareness on HW / OS level(general-purpose) • Increasingerrorrates • Increasingresiliencycosts • Key Observation • Different resiliencyrequirements • Data managementcontextknowledge • Resiliency-Aware Data Management • Exploitcontextknowledgeofqueryprocessinganddatastorage • Efficiency (reducedresiliencycosts) • Effectiveness(detection/correction) Qi Ui inputstreams mission- critical queries Data Management Data Management Data System Access System Storage System HW/OS primitives configuration OS / Middleware HW Infrastructure Resiliency-Aware Data Management
Resilient Database Challenges C1: ResilientQuery Processing C3: Resiliency-Aware Optimization C2: ResilientData Storage Resiliency-Aware Data Management
C1: Resilient Query Processing C1: QP C3: Opt • Challenge • Problem: missing/invalid tuples (explicit/implicit) • Goal: reliablequeryresultsbyerrorcorrection / error-tolerant algorithms • Example (AdvancedAnalytics) • Q: Ψk=365(γ( σa<107R⋈S⋈T⋈U )) • Computationredundancy C2: DS Plan Scheduling Operator Semantics Intermediate Results Guard Plan Ψk=365 Check γ γ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ σa<107 T S U σa<107 T S U R R Resiliency-Aware Data Management
C1: Resilient Query Processing (2) C1: QP C3: Opt • Example (AdvancedAnalyticscont.) • AR(2), MSE, L-BFGS-B, C40 Energy Demand • P( )=0.01 • val∈ [0,max] • N=100 C2: DS Approximate Query Results Error-Tolerant Algorithms Error-Proportional Overhead Resiliency-Aware Data Management
C2: Resilient Data Storage C1: QP C3: Opt • Challenge • Problem: dataloss/corruption (explicit/implicit) • Goal: datastabilitybydataredundancyanderrorcorrection • Example (Data Partitioning) • Table R (a,b,c) • Data redundancy(synopsisandreplicas) • Optimization • Exploitthe multiple replicas (complementary)layouts • E.g., different sortingorders, partitioningschemes, compressionschemes, etc C2: DS Synopsis SR Synopsis SR‘ Table R Table R‘ Test Scheduling Multiple Replicas WorkloadCharacteristics Time-based /on-the-flyerrordetectionandcorrection Resiliency-Aware Data Management
C3: Resiliency-Aware Optimization C1: QP C3: Opt • Challenge • Problem: searchspaceof QP/DS, HW heterogeneity • Goal: Multi-objectiveoptimization (performance, accuracy, energy, resiliency) • Example (Frequency/VoltageScaling (DFS,DVS)) • 1) Choosefrequencylevel • 2) Select voltagescheme • 3) Optimizevoltage • E.g., decreasedfrequency/voltage C2: DS Q: Ψk=365 γ ⋈ ⋈ ⋈ DFS/DVS – – σa<107 T S U (+) – + + (–) Performance – R convex Errors Energy Multi-Objective, Global, Architecture-Aware Optimization + Accuracy Resiliency-Aware Data Management
Conclusion • Problem of State-of-the-Art • General-purposeresiliencymechanismsat HW/OS level • Increasingerrorrates increasingresiliencycosts • Summary • Vision of „Resiliency-Aware Data Management“ • Challenge Resilient Query Processing • Challenge Resilient Data Storage • Challenge Resiliency-Aware Optimization • Research directionsandmore in thepaper! • Conclusion / New Opportunities • Resiliency-awaredatamanagementcanreduceresiliencycosts • Research Opportunity: • Reconsiderationofmany DB aspectsw.r.t. resiliency • ColloborationOpportunity: • Inter-disciplinaryresearchfield (HW, OS, Systems, DB) Resiliency-Aware Data Management
ChooseyourResiliency Level! Resiliency-Aware Data Management
Resiliency-Aware Data ManagementMatthias Boehm1 Wolfgang Lehner1 Christof Fetzer2TU Dresden 1 Database Technology Group2 Systems Engineering GroupAugust 30, 2011
Background andRelated Work Resiliency-Aware Data Management
Background andRelated Work • Taxonomy • Faults (techdefects), Errors (system-internal), Failures (system-external) • Staticvs Dynamic Errors (memory / computation) • Static (hard / permanent): cosmicradiation, dynamicvariability, aging • Dynamic (soft / transient): staticvariability, aging • Implicit vs. Explicit Errors • Implicit: silenterrors general-purposetechniques (ECC, etc) • Explicit: detectedorcorrectederrors • Related Work @ DB-Level • Error-awareframeworks (e.g., MapReduce/Hadoop) general-purposetechniques • Recoveryprocessing / replication[Upadhyaya11] reacting on explicit errors • Implicit: [Graefe09],[Borisov11], [Simitsis10] specific DM aspects Holisticresilientdatamanagement Resiliency-Aware Data Management
ChooseyourResiliency Level! Resiliency-Aware Data Management
TX Level vs. Resiliency Level • Similarities • Different application requirements on integrity • TX: physical and operational integrity • Resiliency: physical integrity • Ensuringintegrityincurrscostoverheads • Contextknowledgecanbeexploitedforreducingcosts • TX: TX scheduling (logicalserialization) • Resiliency: challengesandusecases • Differences • Configurationgranularity • TX: wecould handle different TX levelconcurrently • Resiliency: configuringHW parameterscanhave global influence on multiple queries on that HW component • Scope • TX: integrityforrunningqueryor TX (assumption: DB istransformedfromoneconsistentstatetoanotherby TX only) • Resiliency: computationanddataintegrity Resiliency-Aware Data Management