What is System Research? Why does it Matter?

What is System Research?Why does it Matter? Zheng Zhang Research Manager System Research Group Microsoft Research Asia

Outline • A perspective of computer system research • By Roy Levin (Manager Director of MSR-SVC) • Overview of MSRA/SRG activities • Example projects

What is Systems Research? • What makes it research? • Something no one has done before. • Something that might not work. • Universities are known for doing Systems Research, but Industry does it too: • VAX Clusters (DEC) • Cedar (Xerox) • System R (IBM) • NonStop (Tandem) • And, more recently, Microsoft

What is Systems Research? • What specialties does it encompass? Computer Architecture Networks Operating Systems Protocols Programming Languages Databases Distributed Applications Measurement Security Simulation (and others...) • Design + implementation + validation • (implementation includes simulation)

What’s Different AboutSystems Research Now? • Scale! • Geographic extent • “Machine room” systems are a research niche. • Administrative extent (multiple domains) • Validation at scale • Single organization often can’t do it. • Consortia, collaborations • “Test beds for hire” (e.g., PlanetLab) • Industrial systems as data sources • And perhaps as test beds?

What’s hot today, and will continue to be? • Distributed system! • Web2.0 is all about distributed system: • Protocol/presentation revolution: • Html  but XML, DHTML  AJAX, http  RSS/Atom… • Service mash-up is really happening • Infrastructure: • Huge cluster inside (e.g. MSN, Google, Yahoo) • Even bigger network outside (e.g. P2P, social network) • Very complex to understand • Fertile ground for advancing theory/algorithmic aspect • Very challenging to build and test • That’s what research is about, isn’t it?

MSRA/SRG Activity Overview

problems The brain: Basic research The hand: Systems/tools solutions “InspectorMorse” The tools “Beauty contest”problems Systems Exploratorysystems/experiments Low hangingfruits Large-scale wide-areap2p file sharing - Maze • “Practical” theory work: • Failure model • Membership protocol • Distributed data structure • DHT spec • … Distributed systembuilding package -WiDS • Machine roomstorage systemand its applications • BitVault • HPC & end user use improve SRG Research focus The theory& practice of distributed system research

Summary of Results (last 9 months)

Some projects in SRG • Building large-scale system • BitVault, WiDS and BSR involvement • Large-scale P2P system • A collaboration project with Beijing University • My view on Grid and P2P computing • Which one would you like to hear?

BitVault and WiDS Plus contribution from BSR

SOMO monitor Check-out Check-in Catalog Load balance delete Soft-state dist. index Object replication & placement Repair Protocol Membership and Routing Layer (MRL) Scalable broadcast Anti-entropy Leafset protocol BitVault: brick-based reliable storage for huge amount of reference data Design points • Low TCO • Highly reliable • Simple architecture $400 a piece • Entirely developed/maintained with WiDS • Adequate performance

BitVault: repair speed • Google File System: 440MB/s in a 227-server cluster • Comparable or better Performance under failure Repair rate vs # of servers

Small-scale simulation Implementation (v.001) Debug Implementation (v.01) Debug Implementation (v.1) Performance debug The “black art” of building a dist. system Pseudo-code/Protocol spec (TLA+, Spec#, SPIN) • Unscalable “distributed” human “log mining” • Non-deterministic bugs - Code divergence - 1/1000 of the real deployment scale (esp. P2P)

Goal: a generic toolkit for integrated distributed system/protocol development • Reduce debug pain • Spend as much energy in single address space as possible • Isolate non-deterministic bugs and reproduce them in simulation • Remove the human body in the log mining business • … • Eliminatecode divergence • Use the same code across development stages (e.g. simulation/executable) • Scale the performance study • Implement efficient ultra-large scale simulation platform • Interface with formal methods • Ultimately: TLA+ spec  Implementation

Application logic • State machine • Event-driven • Object-oriented Protocol Instance • Periodic and one-time timer • PostMessage, async callback Isolate implementation from runtime WiDS: an API set for programmer Timer Msg. Handler

One address space Protocol Instance Protocol Instance Protocol Instance Timer Timer Timer Msg. Handler Msg. Handler Msg. Handler WiDS-Dev WiDS-Dev WiDS-Dev Eventwheel WiDS: as development environment Network Model • Single address space debugging of multiple instances • Small scale simulation (~10K)

Protocol Instance Protocol Instance Protocol Instance Timer Timer Timer Msg. Handler Msg. Handler Msg. Handler WiDS-Comm WiDS-Comm WiDS-Comm Network WiDS: as deployment environment • Ready to run!

Implementation WiDS starts here Debug The WiDS-enabled process SPEC • No code divergence • Large scale study • Virtualize dist. system debugging process Protocol WiDS-Dev WiDS-Par Performance Evaluation (include large scale) Deployment Optimization WiDS-Comm WiDS-par has been used to test 2Million real instances Using 250+ PCs

What makes a storage system reliable? • MTTDL: Mean Time To Data Loss“After the system is loaded with data objects, how long on average the system can sustain before it permanently loses the first data object.” • Two factors: • Data repair speed • Sensitivity to concurrent failures

Sequential Placement • Pro: Low likelihood of data loss when concurrent failures occur

Repair in Sequential Placement • Con: Low parallel repair degree leading to relatively high likelihood of concurrent failures

Random Placement • Con: sensitive to concurrent failures

Repair in Random Placement • Pro: High parallel repair degree leading to low likelihood of concurrent failures

Random placement is better with large object size Random placement is bad with small object size Comparison MTTF=1000days, B=3GB/s, b=20MB/s, c=500GB,user data 1PB

ICDCS’05 result summary • Established the first framework of object placement’s impact on reliability • Upshot: • Spread your replicas as widely as you can • Up to the point when BW is fully utilized for repair • More than that will hurt reliability • Core algorithm adopted by many MSN large scale storage projects/products

Ongoing work: • More on object placement: • We are looking at system with extremely longevity • Heterogeneity capacity and other dynamics have not been factored in • Improving WiDS further: • It is still so hard to debug!! • Idea: • Replay facility to take logs from deployment • Time-travel inside simulation • Use model and invariance checker to identify fault location and path • See SOSP’05 poster

Maze With Beijing Univ Maze Team

Maze File Sharing System • The largest in China • On CERNET, popular with college students • Population: 1.4 million registered accounts; 30,000+ online users • More than 200 million files • More than 13TB (!) transfer everyday • Completely developed, operated and deployed by an academic team • Logs added since the collaboration w/ MSRA last year • Enable detailed study at all angles

Rare System for Academic Studies • WORLD’04: system architecture • IPTPS’05: The “free-rider” problem • AEPP’05: Statistics of shared objects and traffic pattern • Incentive to promote sharing  collusion and cheating • Trust and fairness  Can we defeat collusion?

Maze Architecture: the Server Side • Just like Napster… • Historic reason: P2P sharing add-ons for T-net FTP search engine Not a DHT!

Maze：Incentive Policies • New users: points == 4096 • Point change: • Uploads: +1.5 points per/MB • Downloads: at most -1.0 point/MB • Gives user more motivation to contribute • Service differentiation: • Order download requests by T = Now – 3log(Point) • First-come-first-serve + large-points-first-serve • Users with P < 512 have a download band-width of 200Kb/s • Available in Maze5.0.3; extensively discussed in Maze forum before implemented

Collusion Behavior in Maze (Partial Result) 221,000 pairs whose duplication degree > 1 the top 100 links with most redundant traffic • The first ever study of such kind • Modeling, simulation done • Deployed and measurement in two months

Problem/ Application System0 System0 deploy log log Simulation/ Development Model1 Model0 ? The Ecosystem of Maze Research • Common to all system research • More difficult in a live system: you can’t go back!

Grid and P2P Computing

Knowing the Gap is Often More Important(Or you risk falling off the cliff!) • The gap is often a manifest of LAW (speed of light) • The gap between Wide-Area (GRID) and cluster/HPC can be just as wide as between HPC and sensor network • Many impossibility theory exist • Negative results are not a bad thing • The bad thing is that many are unaware! • Example: • The impossibility of consensus in asynchronous network • The impossibility of achieving consistency, availability and partition resilient simultaneously

What is Grid? Or, the Problem that I see • Historically associated with HPC • Dressed up when running short on gas • Problematically borrowing concept from an environment governed by a different law • Internet as a grand JVM unlikely • Need to extract commonsystem infrastructure after gaining enough application experiences • Sharing and collaboration are labels applied without careful investigation • Where/what is the 80-20 sweet-spot? • Likewise, adding the P2P spin should be carefully done

What is Grid? (cont.) • Grid <= HPC + Web Service • HPC isn’t done yet; check google • Why? • You need HPC to run the apps, or store the data • Service has clear boundary • Interoperable protocol bound the services

P2P computing: inspiration from Cellular Automata [A New Kind of Science, Wolfram, 2002] Program Computation Similar to traditional parallel computing logic: read input data for a while { compute output data region input edge regions }

Many Applications Follow the Same Model • Enterprise computing • MapReduce and similar tasks in data processing • Sorting and querying • Coarse-Grain Scientific Computing • Engineering and product design • Meteorological simulation • Molecular biology simulation • Bioinformatics computation • Hunting for the next low-hanging fruits after seti@home, folding@home

WAN/LAN does not Matter when C/B is large More processes matter! Is it feasible? Consider: 2D CA; LAN and WAN N N n n LAN WAN N N>>n n Computing Density /Traffic (instr/byte) C0/B0

What’s the point of Grid, or not-Grid? • Copy ready concepts across context is easy • But it often does not work • Context is governed by the law of physics • Need to start building and testing applications • Then we can define what “Grid OS” is truly about

Thanks

What is System Research? Why does it Matter?