1 / 27

Berkeley RAD Lab Technical Vision

Berkeley RAD Lab Technical Vision. Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005. Outline. Overall Vision Internet Services Vision (ServRADS) Network Vision (NetRADS) Internet Services Network architecture Principles and Summary.

beckiet
Download Presentation

Berkeley RAD Lab Technical Vision

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Berkeley RAD LabTechnical Vision Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005

  2. Outline • Overall Vision • Internet Services Vision (ServRADS) • Network Vision (NetRADS) • Internet Services Network architecture • Principles and Summary

  3. Overarching Mantra Enable a faster pace of network service innovationthrough new distributed system architecturesthat reduce operations cost by 2-3 orders of magnitude The Challenge: Software systems: Too much information => make sense of it through statistical learning & control theory Network systems: Too little information => exploit better observation and monitoring in the network infrastructure to drive management processes

  4. In practice this means … • Single person can write, deploy, operate the next-generation IT business (“the Fortune 1 million”) • Do for Internet apps what Web did for individual publishing • Gray’ s challenge: planetary-scale distributed system operated by a single part-time operator • Goal: programmers focus on functionality; put the *ility in the platform • Could be built on utility computing, giving access to distributed physical resources • Integrated approach to network and server/service management Requires 100x-1000x reduction in TCO from today’s levels

  5. What things are like today • World-scale services created and operated by expert teams • “Google-sized organization” to create a Google • Amazon’s book browsing, designed by programmers, is cumbersome • Browsing for housewares, designed by domain experts on mature infrastructure, more usable • We don’t know what the next “killer app” will be! • NOW project didn’t predict Internet search as a “Killer app” for NOW’s If we succeed, the next killer Internet app will be written, deployed, operated, at Google-like scales, by a single programmer

  6. Focusing on lowering cost of ownership • Standard way to account for “where the money goes” in operating a deployed distributed application • Definition independent of who is operating the app • Operators per byte of storage or per CPU? No, doesn’t scale with technology changes • Operators per end-user served? (This is the figure of merit for e-tailers) • Operators per geographic region served? • Operators per $ spent on capital cost? • Operators per $ of revenue?

  7. Outline • Overall Vision • Internet Services Vision (ServRADS) • Network Vision (NetRADS) • Internet Services Network architecture • Principles and Summary

  8. Enabling Technologies for Reducing TCO in ServRADS • Past successes • microrebooting: Fast recovery makes false positives tolerable • Pinpoint: using SLT to detect and localize fine-grain failures • visualization+SLT to help operators & earn their trust • Elements of technical vision • SLT and machine learning • Operator-centric visualization • Control theory • “Open source” failures database (sanitized, open failures & forensics repository)

  9. Example scenarios • Helping operators make sense of instrumentation • Using ML techniques to localize failures (P. Bodik, E. Kiciman) • Using automatically-induced statistical models to identify likely causes of performance problems (S. Zhang, I. Cohen et al.) • Combining SLT with visualization for cross-checking problem reports and rapidly spotting potential problems visually • Automating problem identification based on stored signatures (S. Zhang, M. Goldszmidt, I. Cohen et al.) • Facilitating self-tuning/configuration • Using control theory to improve performance of a distributed streaming database (W. Xu) • Service placement in wide-area distributed system (D. Oppenheimer) • Microreboots (G. Candea) and microreplacement (S. Kawamoto) as low-cost prevention/repair strategies If false positive cost can be kept low, automate. Otherwise, help operator do her job.

  10. Services example: combining viz + SLT

  11. Reduce TCO via Planetary-scale Abstractions • Inspiration: narrowly-focused planetary-scale abstractions whose design & implementation... • scale well: understand distributed scheduling, locality, symptoms of wide-area failures • monitorable and controllable (using SLT & linear CT) • retain precisely-quantifiable and “acceptable” semantics under partial-failure conditions • Examples of existing “narrow but powerful” services • MapReduce in Google understands data locality • Can easily imagine a “lossy” MapReduce, like online aggregation • queues/messaging in Yahoo, Amazon, others • User information database in Yahoo • Instrumentation collection & analysis services using Telegraph-CQ

  12. Outline • Overall Vision • Internet Services Vision (ServRADS) • Network Vision (NetRADS) • Internet Services Network architecture • Principles and Summary

  13. RADS Network Problem • Internet routing has proven to be robust • But … • Poor visibility: hard to determine health of the network • Routing policy interactions defeat propagation of useful diagnostic info: difficult to identify root cause problems • Slow reaction times to connectivity failures; operator intervention (across admin domains) increases cost of ownership • Key observation: network service failures attributed to unexpected traffic patterns • Approach: identify and protect “good” traffic • Mechanism deployed in network edge: • It’s where the servers and clients are located • Greatest need for lowering management costs • Administrative scope and responsibility is well-defined

  14. iBoxes: New network element for Observe, Analyze, Act Enterprise Network Architecture Inspection-and-Action Boxes: Deep multiprotocol packet inspection No routing; observation & marking Policing points: drop, fence, block

  15. Network-Level Observe-Analyze-Act • Observe • Packet, path, protocol, service invocation statistical collection and sampling: frequencies, latencies, completion rates • Construct the collection infrastructure • Analyze • Determine correlations among observations • “Normal” model discovery + anomaly detection • Exploit SLT • Act • Experiment to test correlations • Prioritize and throttle • Mark and annotate • Control theory? Distributed analyses and actions

  16. Application Presentation Session Transport Annotation Network Link Phy Network Layer Mechanism: Annotations • Enhance network visibility: disseminate observations, communicate actions, provide in-band network management actions, iBox-to-iBox communications • iBoxes label packets at annotation layer but do not rewrite packet contents • Annotations stack, must be removed from packets before delivery to A-layer unaware end nodes

  17. Scenario: Traffic Surge Inhibiting Network Services Internet Edge II • DNS Server swamped by excessive request traffic • Observe: DNS time outs, Web access traffic slowed, but also higher than normal mail delivery latency implying busy server edge (correlation between Mail Server and DNS Server utilization?) • Root Cause: High DNS request rates generated by Spam Appliance triggered by mail surge R Primary & Secondary DNS Servers Distribution Tier S S E Mail Server E R R S IA IS E Spam Appliance Server Edge Access Edge E S

  18. Scenario Internet Edge II • How Diagnosed? • I-S detects high link utilization but abnormally high DNS traffic • Stats from I-I: high mail traffic, low outgoing web traffic, in traffic high but link utilization not high • Stats from I-A: lower web traffic, no unusual mail origination • Problem localized to Server edge, but visibility limited: RADS can help R Primary & Secondary DNS Servers Distribution Tier S S E Mail Server E R R S IA IS E Spam Appliance Server Edge Access Edge E S

  19. Scenario Internet Edge II • Possible Action Responses • Experiment: Redirect local DNS requests to Secondary DNS server: if these complete, can infer the server is the problem, not the network • Throttle: Due to MS-DNS correlation, block/slow email traffic at Server Edge: should expect reduced DNS server utilization R Primary & Secondary DNS Servers Distribution Tier S S E Mail Server E R R S IA IS E Spam Appliance Server Edge Access Edge E S

  20. Outline • Overall Vision • Internet Services Vision (ServRADS) • Network Vision (NetRADS) • Internet Services Network architecture • Principles and Summary

  21. Embodying principles in a prototype • Platform architecture and prototype to enable rapid innovation in network services by non-experts • automatically accommodates scaling, provisioning, failure management • multi-datacenter (geoplexed) • observable networks connecting datacenters • potentially planetary scale • runs with minimal operator oversight • Prototype keeps various research projects focused on common goal and allows ongoing testing • Participation in standards processes to promote “best practices” in platform as open standards

  22. Server Client Distributed Middleware Distributed Middleware Router Router Internet IP Network Reliable Adaptive Distributed Systems Operator User Prototype Applications Programming Abstractions For Roll-back and wide-area distributed computations SLT Services Crash-only services + Observation Infrastructure forSystem SLT Application- Specific Overlay Network Checkable Protocols Fast Detection & Route Recovery ObservationInfrastructure for network SLT iBox iBox Edge Network Edge Network Commodity Internet

  23. Buffers Buffers Buffers Input Ports Output Ports CP CP CP CP CP CP AP CP Interconnection Fabric Action Processor Classification Processor Generic iBox Architecture “Tag” Mem Rules & Programs

  24. Possible architecture of a rack app. server & application, e.g. J2EE Microrecovery actions Datacenter boundary From other datacenters High-leveleffectors SLTalgo. SLTalgo. SLTalgo. To other datacenters Control loops High-level sensor data Externally-inducedfailures, workload changes, etc. T-CQ engine Sanitizeddata Visualization SLTalgo. SLTalgo. SLTalgo. Preprocesseddata Syndrome identification To otherdatacenters

  25. Outline • Overall Vision • Internet Services Vision (ServRADS) • Network Vision (NetRADS) • Internet Services Network architecture • Principles and Summary

  26. ServRADS: Observations & Summary • SLT algorithms make sense of large amounts of data • Classification, outlier/anomaly detection, clustering, etc. • Viz helps operator use “visual pattern recognition” to quickly spot problems and cross-check SLT models • Enables operator expertise to be quickly brought to bear • Builds operators’ trust in statistical/machine learning models • Challenge • Fundamental challenges associated with applying SLT to problem determination (coming up next session) • Unifying many techniques into a coherent approach - prototype platform as unifying artifact • Idea: capture best practices in TCO-optimized, planetary-scale abstractions

  27. NetRADS: Observations & Summary • COPS: Paradigm for (more) automatically protecting critical resources when network is under stress • Checkable protocols: visible semantics • Observe network behavior: good (easy), bad (hard), suspicious • Protect services: throttle, redirect • Network management major contributor to TCO • NetRADS built on: • iBoxes: pervasive infrastructure for observation and action at the network level • Annotation Layer: for marking, control, inter-iBox communications • Integration with Internet service approach for service/server-level visibility and integrated management

More Related