Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft

Gordon Bell 450 Old Oak Court Los Altos, CA 94022 GBell@Microsoft.com Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft.com

MetaMessage: Technology Ratios Are Important • If everything gets faster&cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER (104 x in 25 years): • communication speed & cost • processor speed & cost (PAP) • storage size & cost • Things getting a little better (10 x in 25 years) • storage latency & bandwidth • real application performance (RAP) • Things staying about the same • speed of light (more or less constant) • people (10x more expensive)

Consequent Message • Processing and Storage are WONDERFULLY cheaper • Storage latencies not much improved • Must get performance (RAP) via • Pipeline parallelism and (mask latency) • Partition parallelism (bandwidth and mask latency) • Scaleable Hardware/Software architecture • Scaleable & Commodity Network / Interconnect • Commodity Hardware (processors, disks, memory) • Commodity Software (OS, PL, Apps) • Scaleability thru automatic parallel programming • Manage & program as a single system • Mask faults

Outline • Storage trends force pipeline & partition parallelism • Lots of bytes & bandwidth per dollar • Lots of latency • Processor trends force pipeline & partition • Lots of MIPS per dollar • Lots of processors • Putting it together

Moore's Law:Exponential Change means continual rejuvenation • XXX doubles every 18 months 60% increase per year • Micro Processor speeds • CMOS chip density (memory chips) • Magnetic disk density • Communications bandwidth • WAN bandwidth approaching LANs • Exponential Growth: • The past does not matter • 10x here, 10x there, soon you're

1GB 128MB 1 chip memory size ( 2 MB to 32 MB) 8MB 1MB 640K 128KB 8KB 1990 1980 2000 1970 1M 16M 256K 4M 64M 1Kbit 4K 16K 64K 256M Moore’s Law For Memory Will Moore's Law continue to hold?

Moore's Law for Memory Capacity with 64Mb DRAMs 4GB $1.6m 8G 1GB Memory Price @ $50/chip 1G $200k 32K 128MB 128M $25k Number of chips 4K 32MB $3k 8M 512 8MB $400 1M 64 640K DOS limit $50 128K 8 1/8th chip $6 1 8K 256M 1M 4M 16M 64M 1Kbit 4K 16K 64K 256K 2000 1970 1990 1980

RAM (a/s) 1e 8 1e 7 Tape B/$ Disk (a/min) 1e 4 1e 6 Disk 1e 5 RAM Tape (a/hr) 1e 4 1e 3 1e 0 1970 1980 1990 2000 1970 1980 1990 2000 Trends: Storage Got Cheaper • Latency down 10x • Bandwidth up 10x • $/byte got 104 better • $/access got 103 better • capacity grew 103 1e 6 1e 2

Partition Parallelism Gives Bandwidth • parallelism: use many little devices in parallel • Solves the bandwidth problem • Beware of the media myth • Beware of the access time myth At 10 MB/s: 1.2 days to scan 1,000 x faster: 2 minutes to scan

Partitioned Data has Natural Parallelism Split a SQL Table across many disks, memories, processors. Partition and/or Replicate data to get parallel disk access

15 4 10 10 12 2 10 10 9 0 10 10 6 -2 10 10 3 -4 10 10 Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Price vs Speed Size vs Speed Cache Nearline Offline Tape Main Tape 1 Secondary Disc Online $/MB Online Size(B) Secondary Tape Disc Tape Offline Main Nearline Tape Tape Cache -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds)

Trends: Application Storage Demand Grew • The New World: • Billions of objects • Big objects (1MB) • The Old World: • Millions of objects • 100-byte objects People Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips Name Address David NY Mike Berk Won Austin People Name Voice Picture Address Papers NY David Mike Berk Won Austin

Good News: Electronic Storage Ratios Beat Paper • File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet • Disk: disk (8 GB =) 4,000$ ASCII: 4 m pages 0.1 ¢/sheet (30x cheaper) • Image: 200 k pages 2 ¢/sheet (similar to paper) • Store everything on disk

Summary (of storage) • Capacity and cost are improving fast (100x per decade) • Accesses are getting larger (MOX, GOX, SCANS) • BUT Latencies and bandwidth are not improving much • (3x per decade) • How to deal with this??? • Bandwidth: • Use partitioned parallel access (disk & tape farms) • Latency • Pipeline data up storage hierarchy (next section)

Interesting Storage Ratios • Disk is back to 100x cheaper than RAM • Nearline tape is only 10x cheaper than disk • and the gap is closing! RAM $/MB Disk $/MB 100:1 • Disk & DRAM look good 30:1 ? 10:1 • ??? Why bother with Tape Disk $/MB Nearline Tape 1:1 1960 1970 1980 1990 2000

1000 Intel MicroProcessor Speeds (mips) P6 source: Intel Pentium 100 486 386 10 286 1 8088 0.1 1980 1990 2000 MicroProcessor Speeds Went Up Fast • Clock rates went from 10Khz to 300Mhz • Processors now 4x issue • SPECInt92 fits in Cache, • it tracks cpu speed • Peak Advertised Performance (PAP) is 1.2 BIPS • Real Application Performance (RAP) is 60 MIPS • Similar curves for • DEC VAX & Alpha • HP/PA • IBM R6000/ PowerPC • MIPS & SGI • SUN

700 600 500 400 300 200 100 0 300 400 500 0 100 200 System SPECint vs Price SGI XL SGI L Pentium 486@66 PCs NCR 3555 SUN 2000 Compaq to 16 proc. NCR 3525 SUN 1000 NCR 3600 AP Tricord ES 5K HP 9000 Price ($s)

Cray GFLOPS vs time 100 MP GFlops 600x in 20 years 38% CAG 10 1 UP GFlops 20x in 20 years 16% CAG 0.1 Micros Live Under the Super Curve • Super GFLOPS went up • uni-processor 20x in 20 years • SMP 600x in 20 years • Microprocessor SPECint went up • CAG between 40% and 70% • Microprocessors meet Supers • same clock speeds soon • FUTURE: • modest UniProcessor Speedups • Must use multiple processors • (or maybe 1 chip is different?) Workstation SpecInt vs Time 1000 Intel Clock, 1979-1995 = 42% CAG Sun 100 45% CAG 10 70% CAG MicroVax 1 1985 1990 1995

PAP vs RAP: Max Memory Performance 10x Better • PAP: Peak Advertised Performance: • 300Mhz x 4x = 1.2 BIPS • RAP: Real Application Performance on Memory Intensive Applications (MIA = commercial): • 2%-4% L2 cache miss, 40MIPS to 80 MIPS • MIA UP RAP improved 50x in 30 years: • Cray 6600 @ 1.4 MIPS in 1964 • Alpha @ 70MIPS in 1994 • Microprocessors have been growing up under the memory barrier • Mainframes have been at the memory barrier

Growing Up Under the Super Curve • Cray & IBM & Amdahl are Fastest Possible (at that time for N megabucks) • Have GREAT! memory and IO • Commodity systems growing up under the super memory cloud. • Near the limit. • Interesting times ahead • use parallelism to get speedup Datamation Sort: cpu time only

Sort Disc Wait Sort OS Disc Wait Memory Wait I-Cache Miss B-Cache D-Cache Data Miss Miss Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not

Storage Latency: How Far Away is the Data?

The Pico Processor • 1 M SPECmarks, • 1TFLOP • 106 clocks to bulk ram • Event-horizon on chip. • VM reincarnated • Multi-program cache • On-Chip SMP Terror Bytes!

SMP commercial Pipeline scientific tps vs cpus 1010 / 50 yrs = 1.5850 1T 1G 1 M 1K 1200 1000 YMP 800 205 FLOPS tps 600 195 UKWeather Forecasting 400 KDF9 200 Mercury Leo 0 1950 2000 0 2 4 6 8 cpus Masking Memory Latency • MicroProcessors got 10,000x faster & cheaper • Main memories got 10x faster • So... how get more work from memory? • cache memory to hide latency (reuse data) • wide memory for bandwidth • pipeline memory access to hide latency • SMP & threads for partitioned memory access

DataFlow ProgrammingPrefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline data to/from the processor Pipe data from source (tape, disc, ram...) to cpu cache

Any Any Sequential Sequential Program Program Any Any Sequential Any Sequential Any Program Sequential Program Sequential Program Program Parallel Execution masks latency • Processors are pushing on the Memory Barrier • MIA RAP << PAP so learn from the FLOPS Pipeline Mask Latency Partition Increase Bandwidth Overlap computation with latency

1 M$ 10 K$ 100 K$ Micro Nano Mini Mainframe 1.8" 2.5" 3.5" 5.25" 9" 14" Thesis: Many Little Beat Few Big • How to connect the many little parts • How to program the many little parts • Fault tolerance?

Clusters: Connecting Many Little CPU 50 GB Disc 5 GB RAM Future Servers are CLUSTERS of processors, discs Distributed Database techniques make clusters work

Success Stories: OLTP • Transaction Processing, Client/Server, File Server have natural parallelism. • lots of clients, • lots of small independent requests • Near-linear scaleup • Support > 10 k clients • Examples • Oracle/Rdb scales to 3.7k tpsA • on 5x4 Alpha Cluster • Tandem Scales to 21k tpmC • on 1x110 Tandem cluster • Shared nothing scales best Throughput vs CPUs 21k tpmC 2 32 110 cpus

Success Stories: Decision • Relational databases are uniform streams of data • allows pipelining (much like vector processing) • allows partitioning (by range or hash) • Relational operators are closed under composition • output of operator can be streamed to next operator • Get linear scaleup on SMP and SN • (Teradata, Tandem, Oracle, Informix,...)

Scaleables: Uneconomic So Far • A Slice is a processor, memory, and a few disks. • Slice Price of Scaleables so far is 5x to 10x markup • Teradata: 70K$ for a Intel 486 + 32MB + 4 disk. • Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk • Intel: 75k$ for an I860 +32MB + 2 disk • TMC: 75k$ for a SPARC 3 + 32MB + 2 disk. • IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk • Compaq Slice Price is less than 10k$ • What is the problem? • Proprietary interconnect • Proprietary packaging • Proprietary software (vendorIX)

10 10 1 Gb 9 10 PC Bus 8 10 CAN 7 10 LAN 1 Mb 6 10 WAN 5 10 4 10 POTS 1 Kb 3 10 2 10 1995 2000 1975 1985 1965 Network Trends & Challenge • Bandwidth UP 104 Price went DOWN • Speed-of-light and Distance unchanged • Software got worse • Standard Fast Nets • ATM • PCI • Myrinet • Tnet • HOPE: • Commodity Net • Good software • Then clusters become a SNAP! commodity: 10k$/slice

Great Debate: Shared What? Shared Memory (SMP) Shared Nothing (network) Shared Disk Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blur Network

Architectural Issues • Hardware will be parallel • What is the programming model? • can you hide locality? No, locality is critical • If build SMP, must program as shared-nothing • Will users learn to program in parallel? • No, successful products give automatic parallelism • With 100s of computers, what about management? • Administration costs 2.5k$/year/PC (lowest estimate) • Cluster must be • As easy to manage as a single system (it is a single system) • Faults diagnosed & masked automatically • Message based computation mode • Transactions • Checkpoint / Restart

SNAP Business Issues • Use commodity components (software & hardware) • Intel won - compatibility is important • ATM will probably win LAN & WAN, not CAN • NT will probably win (UNIX too fragmented) • SQL is wining parallel data access. • What else? • Automatic parallel programming • Key to scaleability • Desktop to glass house. • Automatic management • Key to economics • Palmtops and mobile may be differentiated.

SNAP Systems circa 2000 Mobile Nets A space, time (bandwidth), & generation scalable environment Legacy mainframe & minicomputer servers & terminals Local & global data comm world Portables Wide-area global ATM network ATM & Ethernet: PC, workstation, & servers Person servers (PCs) scalable computers built from PCs + CAN Centralized & departmental servers built from PCs ??? TC=TV+PC home ... (CATV or ATM or satellite)

The SNAP Software Challenge • Cluster & Network OS. • Automatic Administration • Automatic data placement • Automatic parallel programming • Parallel Query Optimization • Parallel concepts, algorithms, tools • Execution Techniques load balance, checkpoint/restart,

Outline • Storage trends force pipeline & partition parallelism • Lots of bytes & bandwidth per dollar • Lots of latency • Processor trends force pipeline & partition • Lots of MIPS per dollar • Lots of processors • Putting it together (Scaleable Networks and Platforms) • Build clusters of commodity processors & storage • Commodity interconnect is key (S of PMS) • Traditional interconnects give 100k$/slice. • Commodity Cluster Operating System is key • Fault isolation and tolerance is key • Automatic Parallel Programming is key

Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft

Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft

Presentation Transcript

Clustering Technology In Windows NT Server, Enterprise Edition Jim Gray Microsoft Research Gray@Microsoft research.Micro

Associate Vice President Jim Gray

Gray

Thomas Gray

James Gray

Gray Wolf

Gray wolf

Roger Gray

Gray Squirrel

Gray Whales

Gray

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

Jim Gray Microsoft Research Alex Szalay Johns Hopkins University

Associate Vice President Jim Gray

Gray Wolves

Jim Gray Microsoft Research Gray@Microsoft research.Microsoft/~Gray 415 778 8222

Recent Progress on Scaleable Servers Jim Gray, Microsoft Research

Microsoft Research Jim Gray Researcher Microsoft Research Microsoft Corporation

A new world record? Jim Gray Microsoft Research