Berkeley RAD Lab: Research in Internet-scale Computing Systems

Berkeley RAD Lab: Research in Internet-scale Computing Systems Randy H. Katz randy@cs.berkeley.edu 28 March 2007

Five Year Mission • Observation: Internet systems complex, fragile, manually managed, evolving rapidly • To scale Ebay, must build Ebay-sized company • To scale YouTube, get acquired by a Google-sized company • Mission: Enable a single person to create, evolve, and operate the next-generation IT service • “The Fortune 1 Million” by enabling rapid innovation • Approach: Create core technology spanning systems, networking, and machine learning • Focus: Making datacenter easier to manage to enable one person to Analyze, Deploy, Operate a scalable IT service

Jan 07 Announcements by Microsoft and Google • Microsoft and Google race to build next-gen DCs • Microsoft announces a $550 million DC in TX • Google confirm plans for a $600 million site in NC • Google two more DCs in SC; may cost another $950 million -- about 150,000 computers each • Internet DCs are the next computing platform • Power availability drives deployment decisions

Datacenter is the Computer • Google program == Web search, Gmail,… • Google computer == Warehouse-sized facilities and workloads likely more common Luiz Barroso’s talk at RAD Lab 12/11/06 Sun Project Blackbox10/17/06 Compose datacenter from 20 ft. containers! • Power/cooling for 200 KW • External taps for electricity, network, cold water • 250 Servers, 7 TB DRAM, or 1.5 PB disk in 2006 • 20% energy savings • 1/10th? cost of a building

Datacenter Programming System See http://www.theserverside.com/news/thread.tss?thread_id=33120 See web2.wsj2.com/ruby_on_rails_11_web_20_on_rocket_fuel.htm • Ruby on Rails: open source Web framework optimized for programmer happiness and sustainable productivity: • Convention over configuration • Scaffolding: automatic, Web-based, UI to stored data • Program the client: write browser-side code in Ruby, compile to Javascript • “Duck Typing/Mix-Ins” • Proven Expressiveness • Lines of code Java vs. RoR: 3:1 • Lines of configuration Java vs. RoR: 10:1 • More than a fad • Java on Rails, Python on Rails, …

Datacenter Synthesis + OS OS Synth • Synthesis: change DC via written specification • DC Spec Language compiled to logical configuration • OS: allocate, monitor, adjust during operation • Director using machine learning, Drivers send commands

“System” StatisticalMachine Learning • S2ML Strengths • Handle SW churn: Train vs. write the logic • Beyond queuing models: Learns how to handle/make policy between steady states • Beyond control theory: Coping with complex cost functions • Discovery: Finding trends, needles in data haystack • Exploit cheap processing advances: fast enough to run online • S2ML as an integral component of DC OS

Datacenter Monitoring • S2ML needs data to analyze • DC components come with sensors already • CPUs (performance counters) • Disks (SMART interface) • Add sensors to software • Log files • D-trace for Solaris, Mac OS • Trace 10K++ nodes within and between DCs • *Trace: App-oriented path recording framework • X-Trace: Cross-layer/-domain including network layer

Middle boxes inserted on physical path Policy via plumbing Weakest link: 1 point of failure, bottleneck Expensive to upgrade and introduce new functionality Identity-based Routing Layer: policy not plumbing to route classified packets to appropriate middlebox services Middleboxes in Today’s DC High Speed Network intrusion detector load balancer firewall

First Milestone: DC Energy Conservation • DCs limited by power • For each dollar spent on servers, add $0.48 (2005)/$0.71 (2010) for power/cooling • $26B spent to power and cool servers in 2005 grows to $45B in 2010 • Attractive application of S2ML • Bringing processor resources on/off-line: Dynamic environment, complex cost function, measurement- driven decisions • Preserve 100% Service Level Agreements • Don’t hurt hardware reliability • Then conserve energy • Conserve energy and improve reliability • MTTF: stress of on/off cycle vs. benefits of off-hours

DC Networking and Power • Within DC racks, network equipment often the “hottest” components in the hot spot • Network opportunities for power reduction • Transition to higher speed interconnects (10 Gbs) at DC scales and densities • High function/high power assists embedded in network element (e.g., TCAMs)

Thermal Image of TypicalCluster Rack Rack Switch M. K. Patterson, A. Pratt, P. Kumar, “From UPS to Silicon: an end-to-end evaluation of datacenter efficiency”, Intel Corporation

DC Networking and Power • Selectively power down ports/portions of net elements • Enhanced power-awareness in the network stack • Power-aware routing and support for system virtualization • Support for datacenter “slice” power down and restart • Application and power-aware media access/control • Dynamic selection of full/half duplex • Directional asymmetry to save power, e.g., 10Gb/s send, 100Mb/s receive • Power-awareness in applications and protocols • Hard state (proxying), soft state (caching), protocol/data “streamlining” for power as well as b/w reduction • Power implications for topology design • Tradeoffs in redundancy/high-availability vs. power consumption • VLANs support for power-aware system virtualization

Why University Research? • Imperative that future technical leaders learn to deal with scale in modern computing systems • Draw on talented but inexperienced people • Pick from worldwide talent pool for students & faculty • Don’t know what they can’t do • Inexpensive -- allows focus on speculative ideas • Mostly grad student salaries • Faculty part time • Tech Transfer engine • Success = Train students to go forth and replicate • Promiscuous publication, including source code • Ideal launching point for startups

Why a New Funding Model? • DARPA has exiting long-term research in experimental computing systems • NSF swamped with proposals, yielding even more conservative decisions • Community emphasis on theoretical vs. experimental-oriented systems-building research • Alternative: turn to Industry for funding • Opportunity to shape research agenda

New Funding Model • 30 grad students + 5 undergrads+ 6 faculty + 4 staff • Foundation Companies: $500K/yr for 5 years • Google, Microsoft, Sun Microsystems • Prefer founding partner technology in prototypes • Many from company attend retreats, advise on directions, head start on research results • Putting IP in Public Domain so partners use but not sued • Large Affiliates $100K/yr: Fujitsu, HP, IBM, Siemens • Small Affiliates $50K/yr: Nortel, Oracle • State matching programs add $1M/year: MICRO, Discovery

Summary • “DC is the Computer” • OS: ML+VM, Net: Identity-based Routing, FS: Web Storage • Prog Sys: RoR, Libraries: Web Services • Development Environment: RAMP (simulator), AWE (tester), Web 2.0 apps (benchmarks) • Debugging Environment: *Trace + X-Trace • Milestones • DC Energy Conservation + Reliability Enhancement • Web 2.0 Apps in RoR

Conclusions • Develop-Analyze-Deploy-Operate modern systems at Internet scale • Ruby-on-Rails for rapid applications development • Declarative datacenter for correct-by-construction system configuration and operation • Resource management by System Statistical Machine Learning • Virtual Machines and Network Storage for flexible resource allocation • Power reduction and reliability enhancement by fast power-down/restart for processing nodes • Pervasive monitoring, tracing, simultation, workload generation for runtime analysis/operation

Discussion Points • Jointly designed datacenter testbed • Mini-DC consisting of clusters, middleboxes, and network equipment • Representative network topology • Power-aware networking • Evaluation of existing network elements • Platform for investigating power reduction schemes in network elements • Mutual information exchange • Network storage architecture • System Statistical Machine Learning

Ruby on Rails = DC PL • Reasons to love Ruby on Rails • Convention over Configuration • Rails framework feature enabled by Ruby language feature (Meta Object Programming) • Scaffolding: automatic, Web based, (pedestrian) User Interface to stored data • Program the client: v 1.1 write browser-side code in Ruby then compile to Javascript • “Duck Typing/Mix-Ins” • Looks like string, responds like string, it’s a string! • Mix-in improvement over multiple inheritance

DC Monitoring • Imagine a world where path information always passed along so that can always track user requests throughout system • Across apps, OS, network components and layers, different computers on LAN, … • Unique request ID • Components touched • Time of day • Parent of this request

*Trace: The 1% Solution • *Trace Goal: Make Path Based Analysis have low overhead so it can be always on inside datacenter • “Baseline” path info collection with ≤ 1% overhead • Selectively add more local detail for specific requests • *Trace: an end-to-end path recording framework • Capture & timestamp a unique requestID across all system components • “Top level” log contains path traces • Local logs contain additional detail, correlated to path ID • Built on X-trace

Trace connectivity of distributed components Capture causal connections between requests/responses Cross-layer Include network and middleware services such as IP and LDAP Cross-domain Multiple datacenters, composed services, overlays, mash-ups Control to individual administrative domains “Network path” sensor Put individual requests/responses, at different network layers, in the context of an end-to-end request X-Trace: comprehensive tracing through Layers, Networks, Apps

Actuator:Policy-based Routing Layer pkt (IDID,IDS) pkt (IDF,IDLB) pkt • Assign ID to incoming packets (hash + table lookup) • Route based on IDs, not locations (i.e., not IP addr) • Sets up logical paths without changing network topology • Set of common middle boxes get single ID • No single weakest link: robust, scalable throughput Load- Balancer (IDLB) Intrusion- Detection (IDID) Service (IDS) Firewall (IDF) • So simple can be done in FPGA? • More general • than MPLS Identity-based Routing Layer

Other RAD Lab Projects • Research Accelerator for MP (RAMP)= DC simulator • Automatic Workload Evaluator (AWE)= DC tester • Web Storage (GFS, Bigtable, Amazon S3) = DC File System • Web Services (MapReduce, Chubby) = DC Libraries

1st Milestone: DC Energy Conservation • Good match to Machine Learning • An optimization, so imperfection not catastrophic • Lots of data to measure, dynamically changing workload, complex cost function • Not steady state, so not queuing theory • PG&E trying to change behavior of datacenters • Properly state problem: • Preserve 100% Service Level Agreements • Don’t hurt hardware reliability • Then conserve energy • Radical idea: can conserving energy improve hardware reliability?

1st Milestone: Conserve Energy & Improve Reliability • Improve component reliability? • Disks: Lifetimes measured in Powered On Hours, but limited to 50,000 start/stop cycles • Idea, if turn off disks 50%, then ≈ 50% annual failure rate as long as don’t exceed 50,000 start/stop cycles (≈ once per hour) • Integrated Circuits: lifetimes affected by Thermal Cycling (fast change bad), Electromigration (turn off helps), Dielectric Breakdown (turn off helps) • Idea: If limited number of times cycled thermally, could cut IC failure rate due EM, DB by ≈ 30%? See“A Case For Adaptive Datacenters To Conserve Energy and Improve Reliability,” Peter Bodik, Michael Armbrust, Kevin Canini, Armando Fox, Michael Jordan and David Patterson, 2007.

RAD Lab 2.0 2nd Milestone: Killer Web 2.0 Apps • Demonstrate RAD Lab vision of 1 person creating next great service and scale up • Where get example great apps, given grad students creating the technology? • Use “Undergraduate Computing Clubs” to create exciting apps in RoR using RAD Lab equipment, technology • Armando Fox is RoR club leader • Recruited Real World RoR programmer to develop code and advise RoR computing club • ≈30 students joined club Jan 2007 • Hire best ugrads to build RoR apps in RAD Lab

Miracle of University Research • Talented (inexperienced) people • Pick from worldwide talent pool for students & faculty • Don’t know what they can’t do • Inexpensive • Mostly grad student salaries ($50k-$75k/yr overhead) • Faculty part time ($75k-$100k/yr including overhead) • Berkeley & Stanford Swing for Fences (R, not r or D) • Even if hit a single, train next generation of leaders • Technology Transfer engine • Success = Train students to go forth & multiply • Publish everything, including source code • Ideal launching point for startups

Chance to Partner with a Great University • Chance to Work on the “Next Great Thing” • US News & World Report ranking of CS Systems universities: 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford • Berkeley & Stanford some the top suppliers of systems students to industry (and academia) • National Academy study mentions Berkeley in 7 of 19 $1B+ industries from IT research, Stanford 4 times • Timesharing (SDS 940), Client-Server Computing (BSD Unix), Graphics, Entertainment, Internet, LANs, Workstations (SUN), GUI, VLSI Design (Spice),RISC (ARM, MIPS, SPARC),Relational DB (Ingres/Postgres), Parallel DB, Data Mining, Parallel Computing, RAID, Portable Communication (BWRC), WWW, Speech Recognition, Broadband last mile (DSL)

Years to > $1B IT industry from Research StartNational Research Council Computer Science & Telecommunications Board, 2003 >

Physical RAD Lab: Radical Collocation • Innovation from spontaneous meetings of people with different areas of expertise • Communication inversely proportional to distance • Almost never if > 100 feet or on different floor • Everyone (including faculty) in open offices • Great Meeting Rooms, Ubiquitous Whiteboards • Technology to concentrate: Cell phone, Ipod, laptop • Google “Physical RAD Lab” to learn more

Example of Next Great Thing Berkeley Adaptive Distributed systems Laboratory (“RAD Lab”) • Founded 12/2005: with Google, Microsoft, Sun as founding partners • Armando Fox, Randy Katz, Mike Jordan, Anthony Joseph, Dave Patterson, Scott Shenker, Ion Stoica • Google “RAD Lab” to learn more

RAD Lab Goal: Enable Next “Ebay” • Create technology to enable next great Internet Service to grow rapidly without growing the organization rapidly • Machine Learning + Systems is secret sauce • Position: “The datacenter is the computer” • Leverage point is simplifying datacenter management • What is the programming language of the datacenter? • What is CAD for the datacenter? • What is the OS for the datacenter?

Berkeley RAD Lab: Research in Internet-scale Computing Systems