Cutting-Edge Approaches for Reliable Distributed Systems

A Research Program inReliable AdaptiveDistributed Systems (RADS) Armando FoxStanford University Michael Jordan, Randy Katz, George Necula, David Patterson, Doug Tygar University of California, Berkeley

Presentation Outline • A New Vision for Networked Systems • Enabling Technology: Statistical Learning Theory • Approaches for Dependability • Approaches for Security • Elements of an Experimental Prototype • Summary and Conclusions

Networked SystemsCurrent State-of-the-Art • Today’s systems: fragile, easily broken, yielding poor reliability and security • Complexity of configurations by humans is overwhelming, infrequently correct, yielding lack of dependability and introducing vulnerabilities • > 50% outages, >90% security break-ins attributed to configuration • Attackers exploit known problems faster than system managers apply known fixes • Overly focused on performance, performance, and cost-performance • Systems based on fundamentally incorrect assumptions • Humans are perfect • Software will eventually be bug free • Hardware MTBF is already very large, and will continue to increase • Maintenance costs irrelevant vs. purchase price

Networked SystemsCost of Failure and its Inevitability • Outage Costs • Amazon: Revenue $3.1B, 7744 employees • Revenue (24x7): $350k per hour • Employee productivity costs: $250k per hour • Total Downtime Costs: $600,000 per hour • Employee cost/hour comparable to revenue, even for an Internet company • People/HW/SW failures are facts, not problems “If a problem has no solution, it may not be a problem, but a fact--not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) • Recovery/repair is how we cope with them

Principles for Reliable Adaptive Distributed Systems • Given errors occur, design to recover rapidly • Partial Restart • Crash only software (1 way to start, stop) • Given humans make (most of the) errors, build tools to help operator find and repair problems • Pinpoint the error • Undo of human error • Note: Errors often associated with configuration • Recovery benchmarks to measure progress • What you can’t measure, you can’t improve • Collect real failure data to drive benchmarks

Networked SystemsComponents of New Approach • Statistical learning algorithms that observe and predict future behaviors • Verification techniques that check for correct behavior, reveal vulnerabilities, harness techniques for the rapid generation of behaviors with desirable properties • Programmable network elements allowing active code to be inserted into the network, to provide observation and enforcement points without the need for access to user end systems

Interdisciplinary Expertise • SLT (Jordan), Network Services/Protocols (Fox, Katz, Patterson, Stoica), and Verification Methods applied to network and security behaviors (Stoica, Tygar) • Comprehensive distributed architecture embedding SLT as building block for critical components for system observation, coordination, inference, correction, and evolution of behaviors • Components suitable for embedding in distributed systems • Network behaviors that reveal correct or incorrect operation of higher-level network applications • Embedding observational and inference means at strategic points in the network, obviating need to modify end hosts or applications • System level heterogeneity and ability to generate new behaviors on demand in response to a dynamic system threat environment to achieve enhanced dependability and resilience to attack • Enabling applications for investigation will include web services, intrusion detection, storage access

Statistical Learning Theory • Toolbox for design/analysis of adaptive systems • Algorithms for classification, diagnosis, prediction, novelty detection, outlier detection, quantile estimation, density estimation, feature selection, variable selection, response surface optimization, sequential decision-making • Classification algorithms • Recent scaling breakthroughs: 10K+ features, millions of data points • Kernel machines; functional analysis and convex optimization • Generalized inner product—similarities among data point pairs • Defined for many data types • Classical linear statistical algorithms “kernelized” for state-of-the-art nonlinear SLT algorithms in many areas

Statistical Learning Theory • Novelty Detection Problem • Unlimited observations reflecting normal activityYet few (or no) instances that reflect an attack or a bug • Second-order cone program; a convex optimization problem with an efficient solution method • Given cloud of data in a high-dimensional feature space, place a boundary around these to guarantee that only a small fraction falls outside • Needed: “on-line” variants of SLT algorithms that update the learning system’s state based on small sets of data • Available for some “kernelized” problems • On-line versions of the best algorithms have yet to be developed!

Statistical Learning Theory • “Super kernels”: combine heterogeneous data via multiple kernels • Semidefinite programs, convex optimization problems with efficient solutions involving efficient decomposition techniques • Useful in fusing evidence at distributed nodes • Problems of interest require combined parameter estimation and optimization • Response surface methodology: building local mappings from configurations to performance, and suggesting gradient directions in configuration space leading to performance improvements • Policy-gradient methods: SLT algorithms that make sequences of decisions, yielding a “behavior” or “policy”; successfully developed policies for nonlinear control problems involving high degrees of freedom

Statistical Machine Learning • Kernel methods • neural network heritage • convex optimization algorithms • kernels available for strings, trees, graphs, vectors, etc. • state-of-the-art performance in many problem domains • frequentist theoretical foundations • Graphical models • marriage of graph theory and probability theory • recursive algorithms on graphs • modular design • state-of-the-art performance in many problem domains • Bayesian theoretical foundations

Vision • First 50 years of computer science • manually-engineered systems • lack of adaptability, robustness, and security • no concern with closing the loop with the environment • Next 50 years of computer science • statistical learning systems throughout the infrastructure • self-configuring, adaptive, sentient systems • perception, reasoning, decision-making cycle

Example I: Statistical Bug-finding • Programs are buggy, yet people use them • Exploit this: use user trials to debug programs • Outline of system: • Instrument programs to take samples at runtime of program state • Collect information over the Internet • Learn a statistical classifier based on successful and failed runs, using feature selection methods to pinpoint the bugs

Case Study: BC • Array overrun bug in re-allocation routine morearrays() leads to memory corruption and sometimes an eventual crash; 2908 features • All top feature indicate indx being unusually large: storage.c:176: morearrays(): indx  optopt storage.c:176: morearrays(): indx  opterr storage.c:176: morearrays(): indx  usemath storage.c:176: morearrays(): indx  quiet storage.c:176: morearrays(): indx  fcount • And this indeed pinpoints the bug

Example II: Novelty Detection • The goal is binary classification • but all of the training data come from one class • Many practical applications • intrusion detection • machine diagnostics • Basic problem---find a boundary that encloses a desired fraction of the data, and is as tight as possible • can be done using the generalized Chebyshev inequality • using kernels, this is a convex problem

Case Study: Analog Circuit Design • case study---a Low Noise Amplifier (LNA) for wireless applications • 7 design parameters (transistor size, bias currents, etc) • 50,000 positive samples • visualize the projection of feasible solutions in a plane representing second-order harmonic distortion (HD_2) and third-order (HD_3) harmonic distortion

Case Study: Analog Circuit Design

Example III: Diagnosis • A probabilistic graphical model with 600 disease nodes, 4000 finding nodes • Node probabilities p(f_i | d) were assessed from an expert (Shwe, et al., 1991) • Want to compute posteriors: p(d_j | f) • Is this tractable?

Case Study: Medical Diagnosis • Symbolic complexity: • symbolic expressions fill dozens of pages • would take years to compute • Numerical simplicity: • Jaakkola and Jordan (1999) describe a variational method based on convexity that computes approximate posteriors in less than a second

Crash-Only Software:Dramatically Simplifying Recovery • Robust systems must be crash-safe • Restart-curable bugs cause outages • Rebooting eliminates many corruptions • Why support any other kind of shutdown/restart?? • Crash-only software, • Shutdown <=> crash; recover <=> restart • Software components provide external PWR switch independent of component behavior • Recovery is inexpensive/safe to try • Crash-only power switch “infrastructure” is simpler than apps, common to all of them • Higher confidence that it will work • Like transaction invariants yet focused on recovery • Can machine learning and statistical monitoring approaches be applied during online operations?

Crash-Only Software: Simplified Recovery Management • Failure detection and recovery management are hard • How to detect that something’s wrong? • How do you know when recovery is really necessary? • Will a particular recovery technique work? • What is the effect on online performance? • What if you needlessly “over-recover”? • Predictable, fast recovery simplifies failure detection and recovery management • Something doesn’t look the way it used to => anomaly • Not all anomalies are failures... but “over-recovering” is OK • If rebooting suspected-bad component doesn’t work: reboot its larger containing group, recursively • Leverage for applying statistical monitoring & machine learning

Crash-Only Software:Practical to Build • Case studies: two crash-only state-storage subsystems (for session state and durable state) • OK to crash any node at any time for any reason • Recovery is highly predictable, doesn’t impact online performance • Replication provides probabilistic durability & capacity during recovery • Access pattern exploited for consistency guarantees • Nine “activity” & “state” statistics monitored per storage brick • Metrics compared against those of “peer” bricks • Basic idea: Changes in workload tend to affect all bricks equally • Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time” • Anomaly in 6 or more (out of 9) metrics => reboot brick • Simple thresholding and substring-frequency used to determine “anomalous”

Crash-Only + Statistical Monitoring = Resilience to Real-World Transients • Simple fault model: observed anomalies “coerced” into crash faults • Surprise! Statisticalmonitoring catches manyreal-world faults, withouta pre-established baseline • Memory bitflips in code, data, checksums (=> crash) • hang/timeout/freeze • Network loss (drop up to 70% of packets randomly) • Hiccup (eg from garbage collection) • Persistent slowdown (one node lags the others) • Overload (TCP-like mechanism used to generate backpressure)

Generalizing Crash-Only: Micro-reboots • Add micro-reboot (uRB) support to middleware • Enhance open-source JBoss J2EE application server with fault injection, code path tracing, micro-reboots • Use automated fault injection + observation to infer propagation of exceptions • During operation, micro-reboot components or component groups suspected of being correlated to an observed failure • uRB’s improve performability • 2-3 orders of magnitude faster than “full” reboots or application reload • Minimizes disruption to users of other (non-faulty) parts of system • Goal is fast recovery, not causal analysis Fast, cheap uRB’s + statistical monitoring provide a degree of application-generic failure detection & recovery

Crash-Oriented Software:Systematic Approach • Some design lessons already learned • OK to say no, OK to make mistakes, interchangeable parts • Systematic approach for generic componentized apps ... • Compiler and languages technology to understand what makes app amenable to microreboots or c/o generally • E.g., racking state management across app components • E.g., establishing “observational equivalence” between executions with and without micro-recovery • Goal: • Static and dynamic analysis of when “safe” to use generic recovery • Aggressive application of machine learning & statistical monitoring to trigger generic recovery mechanisms • High confidence in mechanisms due to simplicity and orthogonality

data/control flow verif. protocol sender receiver Self-Verifiable Protocols:Statement of the Problem • Problem: Detect and contain network effects of misconfigurations and faulty/malicious components • Approach: design network protocols so each component verifies correct behavior of the protocol • Examples: • e2e protocols • routing (BGP) protocols

Self-Verifiable ProtocolsCase Study: BGP • Propagating invalid BGP routes can bring the Internet down • Multiple causes • Router misconfigurations: happen daily, yielding outages lasting hours • Malicious routers: huge potential threat • Routers with default passwords • Possible to “buy” routers’ passwords on darknets • Existing solutions • Hard to deploy (e.g., Secure-BGP), or insufficient security • Our solution: • Whisper: verify the correctness of router advertisements • Listen: verify the reachability on the data plane

Self-Verifiable Protocols:BGP Whisper • Use redundancy to check consistency of peer’s information • Whisper game: • Group sits in a circle., person whispers secret phrase to neighbors • Person at other end concludes: • Phrase is correct if same phrase from both neighbors • Otherwise, at least one phrase is incorrect

Self-Verifiable Protocols:BGP Whisper • AS1 advertises its address prefix • Chooses a secrete key “x”, and sends y = h(x) • h(): well-known one-way hash function • Every router forwards y = h(y) • AS4 performs consistency check: (y1)3 = (y2)3 ? • If yes, assume both routes are correct • If no, at least one rout is incorrect (but don’t know which) rise a flag (AS1,AS2,y1=h2(x)) (AS1,AS2,AS3,y1=h3(x)) AS3 AS2 (AS1, y1=h(x)) AS4 Chose secret key “x” AS1 (AS1,AS3,y2=h2(x)) (AS1, y2=h(x)) AS3

Self-Verifiable Protocols:BGP Listen • Monitor progress of TCP flows • If TCP flow doesn’t make progress, might be because route is incorrect • Use heuristics to reduce number of false positives and negatives • Still difficult to handle traffic patterns like port scanners • Use SLT techniques to improve the detection accuracy?

Self-Verifiable Protocols:Status and Future Plans • Two examples: • BGP verifications (Listen & Whisper) • Can trigger alarms and contain malicious routers • Minimal changes to BGP; incrementally deployable (Listen) • Self-verifiable CSFQ • Per-flow isolation without maintaining per flow state • Detect and contain malicious flows • Ultimate goal: develop distributed system able to self diagnose and self-repair • Eliminate faulty components • Minimum raise a flag in case of configurations and attackers • Develop set of principles and techniques for robust protocols

F5 Networks BIG-IP LoadBalancer Web server load balancer Network Appliance NetCache Localized content delivery platform Packeteer PacketShaper Traffic monitor and shaper Cisco SN 5420 IP-SAN storage gateway Ingrian i225 SSL offload appliance Nortel Alteon Switched Firewall CheckPoint firewall and L7 switch Cisco IDS 4250-XL Intrusion detection system NetScreen 500 Firewall and VPN Extreme Networks SummitPx1 L2-L7 application switch Enabling Technology:Edge Services by Network Appliances In-the-Network Processing: the Computer IS THE Network

Buffers Buffers Buffers Input Ports Output Ports CP CP CP CP CP CP AP CP Interconnection Fabric Action Processor Classification Processor Generic PNE Architecture Tag Mem Rules & Programs

Enabling Technology:Programmable Networks • Problem • Common programming/control environment for diverse network elements to realize full power of “inside the network” services and applications • Approach • Software toolkit and VM architecture for PNEs, with retargetable optimized backend for diverse appliance-specific architectures • Current Focus • Network health monitoring, protocol interworking and packet translation services, iSCSI processing and performance enhancement, intrusion and worm detection and quarantining • Potential Impact • Open framework for multi-platform appliances, enabling third party service development • Provable application properties and invariants; avoidance of configuration and “latest patch not installed” errors

Enabling Technology:Programmable Networks • Generalized PNE programming and control model • Generalized “virtual machine” model for this class of devices • Retargetable for different underlying implementations • Edge services of interest • Network measurement and monitoring supporting model formation and statistical anomaly detection • Framework for inside-the-network “protocol listening” • Selective blocking/filtering/quarantining of traffic • Application-specific routing • Faster detection and recovery from routing failures than is possible from existing Internet protocols • Implementation of self-verifiable protocols

Security of Networked SystemsLearning Systems Opportunity • New focus on network-wide attacks • E.g., worms, denial of service • Arise suddenly, spread quickly • No time to deploy patches or filters to protect machines • SLT offers promises for improvements • Distributed, so information across machines are shared • Handles changes in user behavior, preventing false positives • Truly distributed SLT systems are possible that can detect and protect against very large-scale security attacks

Security of Networked SystemsTechnical Approach • Mechanisms to learn, share, repair against potential threats to dependability • Strengthen assurance of shared information via lightweight authentication and encryption • TESLA authentication system: replaces public-key crypto with lightweight symmetric encryption; uses time asymmetry to provide assurance • Messages initially encrypted, verification keys revealed later—prevents attacker from using a received key to forge messages • Variations provide instant authentication. • Athena system: generate random instances of secure protocols • Ultra-fast checking software—model-checking & proof-theoretic techniques to verify protocols against stated requirements • Intelligently generate most efficient secure protocol satisfying requirements or a random instance of a secure protocol satisfying a given set of requirements • Apply for SLT systems to more quickly exchange information

System Prototype • Comprehensive system architecture • Reduction of SLT to practical software components embedded within a distributed systems context • Exhibition of an architecture for dramatically improving the reliability and security of important systems through observation-coordination-adaptation mechanisms.

RADS Prototype Applications • E-mail Systems/Messaging • Scale • Distribution, heterogeneity • “Non-stop” • Reactive Systems • E.g., Distributed worm detection • Network/Web Services • Financial Applicatioms • Collective Decision Making/Electronic Voting • Security, privacy • “Non-stop”

Server Client Distributed Middleware Distributed Middleware Router Router Internet IP Network RADS Conceptual Architecture Operator User Prototype Applications: E-voting, Messaging, E-Mail, etc. Programming Abstractions For Roll-back SLT Services Crash-Oriented Svrcs Observation Infrastructure forSystem SLT Application- Specific Overlay Network Verifiable Protocols Fast Detection & Route Recovery ObservationInfrastructure for network SLT PNE PNE Edge Network Edge Network Commodity Internet

Summary and Conclusions

Cutting-Edge Approaches for Reliable Distributed Systems

Cutting-Edge Approaches for Reliable Distributed Systems

Presentation Transcript

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems