550 likes | 570 Views
Recent Progress on Scaleable Servers Jim Gray, Microsoft Research.
E N D
Recent Progress on Scaleable ServersJim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays of commodity processors, disks, and networks into a cluster that provides a single system image. True, vector-supers still are 10x faster than commodity processors on certain floating point computations, but they cost disproportionately more. Indeed, the highest-performance computations are now performed by processor arrays. In the broader context of business and internet computing, processor arrays long ago surpassed mainframe performance, and for a tiny fraction of the cost. This talk first reviews this history and describes the current landscape of scaleable servers in the commercial, internet, and scientific segments. The talk then discusses the Achilles heels of scaleable systems: programming tools and system management. There has been relatively little progress in either area. This suggests some important research areas for computer systems research.
Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: • many little & few big • Many little has credible programming model • tp, web, mail, fileserver,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects
SMP Super Server Departmental Server Personal System ScaleabilityScale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs
Hardware commodity processors nUMA Smart Storage SAN/VIA Software Directory Services Security Domains Process/Data migration Load balancing Fault tolerance RPC/Objects Streams/Rivers Key Technologies
MAPS - The Problems • Manageability: N machines are N times harder to manage • Availability: N machines fail N times more often • Programmability: N machines are 2N times harder to program • Scaleability: N machines cost N times more but do little more work.
Manageability • Goal: Systems self managing • N systems as easy to manage as one system • Some progress: • Distributed name servers (gives transparent naming) • Distributed security • Auto cooling of disks • Auto scheduling and load balancing • Global event log (reporting) • Automate most routine tasks • Still very hard and app-specific
Availability • Redundancy allows failover/migration (processes, disks, links) • Good progress on technology (theory and practice) • Migration also good for load balancing • Transaction concept helps exception handling
Programmability & Scaleability • That’s what the rest of this talk is about • Success on embarrassingly parallel jobs • file server, mail, transactions, web, crypto • Limited success on “batch” • relational DBMs, PVM,..
Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: • many little & few big • Many little has credible programming model • tp, web, mail, fileserver,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects
Scaleup Has Limits(chart courtesy of Catharine Van Ingen) • Vector Supers ~ 10x supers • 3 ~ GFlops • bus/memory ~ 20 GBps • IO ~ 1GBps • Supers ~ 10x PCs • 300 ~ Mflops • bus/memory ~ 2 GBps • IO ~ 1 GBps • PCs are slow • 30 ~ Mflops • and bus/memory ~ 200MBps • and IO ~ 100 MBps
Loki: Pentium Clusters for Sciencehttp://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beowulf.html • Scientists want cheap mips.
Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LANL/Cray: ? Maui Supercomputer Center 512x1 SP2 Your Tax Dollars At WorkASCI for Stockpile Stewardship
TOP500 Systems by Vendor(courtesy of Larry Smarr NCSA) 500 Other Japanese Vector Machines Other DEC 400 Intel Japanese TMC Sun DEC Intel HP 300 TMC IBM Sun Number of Systems Convex HP 200 Convex SGI IBM SGI 100 CRI CRI 0 Jun-93 Jun-95 Jun-96 Jun-98 Jun-94 Jun-97 Nov-93 Nov-95 Nov-96 Nov-94 Nov-97 TOP500 Reports: http://www.netlib.org/benchmark/top500.html
NCSA Super Cluster • National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
A Variety of Discipline Codes -Single Processor Performance Origin vs. T3EnUMA vs UMA(courtesy of Larry Smarr NCSA)
Basket of Applications Average Performance as Percentage of Linpack Performance(courtesy of Larry Smarr NCSA) 22% Application Codes: CFD Biomolecular Chemistry Materials QCD 25% 19% 14% 33% 26%
Uniprocessor RAP << PAP real app performance << peak advertised performance Growth has slowed (Bell Prize 1987: 0.5 GFLOPS 1988 1.0 GFLOPS 1 year 1990: 14 GFLOPS 2 years 1994: 140 GFLOPS 4 years 1998: 604 GFLOPS xxx: 1 TFLOPS 5 years? Time Gap = 2N-1 or 2N-1 where N =( log(performance)-9) Observations
“Commercial” Clusters • 16-node Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)
Oracle/NT • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks=2.7 TB
Building 11 Staging Servers (7) Ave CFG: 4xP6, Internal WWW Ave CFG: 4xP5, European Data Center premium.microsoft.com IDC Staging Servers 512 RAM, www.microsoft.com 30 GB HD (1) MOSWest (3) Ave CFG: 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Servers 512 RAM, SQLNet 30 GB HD Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Router Download 30 GB HD DMZ Staging Servers Ave CFG: Replication 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Router 1 GB RAM, Live SQL Servers 160 GB HD Download Server 160 GB HD SQL Reporting Ave Cost: $83K Ave CFG: 4xP6, (1) MOSWest Switched Ave CFG: FY98 Fcst: 4xP6, 2 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD 512 RAM, 160 GB HD Ave Cost: $83K 50 GB HD FY98 Fcst: 12 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com 50 GB HD Ave CFG: premium.microsoft.com 4xP6, (2) (3) 512 RAM, Ave CFG: 4xP6, (1) 30 GB HD home.microsoft.com 512 RAM, Ave CFG: 4xP6, home.microsoft.com Ave CFG: 4xP6, Ave Cost: $28K 160 GB HD FDDI Ring 512 RAM, (3) 512 RAM, FY98 Fcst: (4) 7 (MIS2) 50 GB HD premium.microsoft.com 30 GB HD Ave CFG: 4xP6 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, 30 GB HD Switched (MIS1) 512 RAM, Ave CFG: 4xP6, Ethernet 30 GB HD 256 RAM, 30 GB HD FTP Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) (2) Router Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router FTP.microsoft.com (45 Mb/Sec Each) (3) FDDI Ring Ave CFG: 4xP5, home.microsoft.com (MIS3) www.microsoft.com msid.msn.com 512 RAM, (2) 30 GB HD (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD register.microsoft.com home.microsoft.com support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD Microsoft.com: ~150x4 nodes (3)
Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware
Total Average Peak 71 Hits 913 m 10.3 m 29 m Queries 735 m 8.0 m 18 m Images 359 m 3.0 m 9 m Page Views 405 m 5.0 m 9 m TerraServer: ExampleLots of Web Hits • 1 TB, largest SQL DB on the Web • 99.95% uptime since 1 July 1998 • No downtime in August • No NT failures (ever) • most downtime is for SQL software upgrades
Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: • many little & few big • Many little has credible programming model • tp, web, mail, fileserver,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects
Two Generic Kinds of computing • Many little • embarrassingly parallel • Fit RPC model • Fit partitioned data and computation model • Random works OK • OLTP, File Server, Email, Web,….. • Few big • sometimes not obviously parallel • Do not fit RPC model (BIG rpcs) • Scientific, simulation, data mining, ...
Many Little Programming Model • many small requests • route requests to data • encapsulate data with procedures (objects) • three-tier computing • RPC is a convenient/appropriate model • Transactions are a big help in error handling • Auto partition (e.g. hash data and computation) • Works fine. • Software CyberBricks
Object Oriented ProgrammingParallelism From Many Little Jobs • Gives location transparency • ORB/web/tpmon multiplexes clients to servers • Enables distribution • Exploits embarrassingly parallel apps (transactions) • HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server
Few Big Programming Model • Finding parallelism is hard • Pipelines are short (3x …6x speedup) • Spreading objects/data is easy, but getting locality is HARD • Mapping big job onto cluster is hard • Scheduling is hard • coarse grained (job) and fine grain (co-schedule) • Fault tolerance is hard
Kinds of Parallel Execution Any Any Sequential Sequential Pipeline Program Program Sequential Sequential Partition outputs split N ways inputs merge M ways Any Any Sequential Sequential Sequential Sequential Program Program
Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 100 second SCAN. BANDWIDTH Parallelism: divide a big problem into many smaller ones to be solved in parallel.
Why are Relational OperatorsSuccessful for Parallelism? • Relational data model uniform operators • on uniform data stream • Closed under composition • Each operator consumes 1 or 2 input streams • Each stream is a uniform collection of data • Sequential data in and out: Pure dataflow • partitioning some operators (e.g. aggregates, non-equi-join, sort,..) • requires innovation • AUTOMATIC PARALLELISM
Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution)
SQL a Non-Procedural Programming Language • SQL: functional programming language describes answer set. • Optimizer picks best execution plan • Picks data flow web (pipeline), • degree of parallelism (partitioning) • other execution parameters (process placement, memory,...) Execution Planning Monitor Schema Executors Plan GUI Optimizer Rivers
Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism
N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows
Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image • Assign one process per processor/disk: • find images with right data & location • analyze image, if 70% snow, return it Landsat Answer date loc image image 33N 120W . . . . . . . 34N 120W 1/2/72 . . . . . .. . . 4/8/95 date, location, & image tests
N X M Data Streams M Consumers N producers River Data Rivers: Split + Merge Streams • Producers add records to the river, • Consumers consume records from the river • Purely sequential programming. • River does flow control and buffering • does partition and merge of data records • River = Split/Merge in Gamma = Exchange operator in Volcano /SQL Server.
Generalization: Object-oriented Rivers • Rivers transport sub-class of record-set (= stream of objects) • record type and partitioning are part of subclass • Node transformers are data pumps • an object with river inputs and outputs • do late-binding to record-type • Programming becomes data flow programming • specify the pipelines • Compiler/Scheduler does data partitioning and “transformer” placement
NT Cluster Sort as a Prototype • Using • data generation and • sort as a prototypical app • “Hello world” of distributed processing • goal: easy install & execute
PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec
Remote Install • Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx()
MULT_QI COSERVERINFO HANDLE HANDLE HANDLE Sort() Sort() Sort() Cluster StartupExecution • Setup : • MULTI_QI struct • COSERVERINFO struct • CoCreateInstanceEx() • Retrieve remote object handle • from MULTI_QI struct • Invoke methods as usual
AAA AAA AAA AAA AAA AAA BBB BBB BBB BBB BBB BBB CCC CCC CCC CCC CCC CCC Cluster Sort Conceptual Model • Multiple Data Sources • Multiple Data Destinations • Multiple nodes • Disks -> Sockets -> Disk -> Disk A AAA BBB CCC B C AAA BBB CCC AAA BBB CCC
Summary • Clusters of Hardware CyberBricks • all nodes are very intelligent • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. • Software CyberBricks • standard way to interconnect intelligent nodes • needs execution model • partition & pipeline • RPC and Rivers) • needs parallelism
Recent Progress on Scaleable ServersJim Gray, Microsoft Research Substantial progress has been made towards the goal of building supercomputers by composing arrays of commodity processors, disks, and networks into a cluster that provides a single system image. True, vector-supers still are 10x faster than commodity processors on certain floating point computations, but they cost disproportionately more. Indeed, the highest-performance computations are now performed by processor arrays. In the broader context of business and internet computing, processor arrays long ago surpassed mainframe performance, and for a tiny fraction of the cost. This talk first reviews this history and describes the current landscape of scaleable servers in the commercial, internet, and scientific segments. The talk then discusses the Achilles heels of scaleable systems: programming tools and system management. There has been relatively little progress in either area. This suggests some important research areas for computer systems research.
What I’m Doing • TerraServer: Photo of the planet on the web • a database (not a file system) • 1TB now, 15 PB in 10 years • http://www.TerraServer.microsoft.com/ • Sloan Digital Sky Survey: picture of the universe • just getting started, cyberbricks for astronomers • http://www.sdss.org/ • Sorting: • one node pennysort (http://research.microsoft.com/barc/SortBenchmark/) • multinode: NT Cluster sort (shows off SAN and DCOM)
What I’m Doing • NT Clusters: • failover: Fault tolerance within a cluster • NT Cluster Sort: balanced IO, cpu, network benchmar • AlwaysUp: Geographical fault tolerance. • RAGS: random testing of SQL systems • a bug finder • Telepresence • Working with Gordon Bell on “the killer app” • FileCast and PowerCast • Cyberversity (international, on demand, free university)
Outline • Scaleability: MAPS • Scaleup has limits, scaleout for really big jobs • Two generic kinds of computing: • many little & few big • Many little has credible programming model • tp, web, fileserver, mail,… all based on RPC • Few big has marginal success (best is DSS) • Rivers and objects