Parallel & Distributed Computing

Parallel & DistributedComputing Jim Gray Microsoft http://research.microsoft.com/~gray/talks

Outline • All God’s Children Got Clusters! • Technology trends imply processors migrated to transducers • Components (Software Cyberbricks) Programming & Managing Clusters • Database experience • Parallelism via transaction processing • Parallelism via dataflow • AutoEverything, AlwaysUp

It’s so natural,even mainframes cluster !Looking closer at usage patterns, a few models emerge Looking closer at sites, hierarchies bunches functional specializationemerge Which are the roses ? Which are the briars ? A cluster is a cluster is a cluster

“Commercial” NT Clusters • 16-node Tandem Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Compaq Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)

Microsoft.com: ~150x4 nodes

HotMail: ~400 Computers

Inktomi (hotbot), WebTV: > 200 nodes • Inktomi: ~250 UltraSparcs • web crawl • index crawled web and save index • Return search results on demand • Track Ads and click-thrus • ACID vs BASE (basic Availability, Serialized Eventually) • Web TV • ~200 UltraSparcs • Render pages, Provide Email • ~ 4 Network Appliance NFS file servers • A large Oracle app tracking customers

Loki: Pentium Clusters for Sciencehttp://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beowulf.html • Scientists want cheap mips.

Intel/Sandia: 9000x1 node Ppro LBL/IBM: 512x8 PowerPC (SP2) LNL/Cray: ? Maui Supercomputer Center 512x1 SP2 Your Tax Dollars At WorkASCI for Stockpile Stewardship

Berkeley NOW (network of workstations) Projecthttp://now.cs.berkeley.edu/ • 105 nodes • Sun UltraSparc 170, 128 MB, 2x2GB disk • Myrinet interconnect (2x160MBps per node) • SBus (30MBps) limited • GLUNIX layer above Solaris • Inktomi (HotBot search) • NAS Parallel Benchmarks • Crypto cracker • Sort 9 GB per second

Wisconsin COW • 40 UltraSparcs 64MB + 2x2GB disk+ Myrinet • SUN OS • Used as a compute engine

Andrew Chien’s JBOBhttp://www-csag.cs.uiuc.edu/individual/achien.html • 48 nodes • 36 HP 2PIIx128 1 diskKayak boxes • 10 Compaq 2PIIx128 1 disk, Wkstation 6000 • 32-Myrinet&16-ServerNet connected • Operational 1/23/98 (we hope!) • All running NT

NCSA Cluster • The National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • Larry Smarr driving it • 500 Pentium cpus, 2k disks, SAN • 1/2 Compaq? 1/2 HP? • Still in design • Build it in 1998 500 CPUs 2,000 Disks SAN A Super Computer 3M$

4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)The Bricks of Cyberspace • Cost 1,000 $ • Come with • NT • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks

CPU 50 GB Disc 5 GB RAM Super Server: 4T Machine • Array of 1,000 4B machines • 1 b ips processors • 1 B B DRAM • 10 B B disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work

Cluster VisionBuying Computers by the Slice • Rack & Stack • Mail-order components • Plug them into the cluster • Modular growth without limits • Grow by adding small modules • Fault tolerance: • Spare modules mask failures • Parallel execution & data search • Use multiple processors and disks • Clients and servers made from the same stuff • Inexpensive: built with commodity CyberBricks

today’s PC is yesterday’s supercomputer Can use LOTS of them Main Apps changed: scientific  commercial  web Web & Transaction servers Data Mining, Web Farming Nostalgia Behemoth in the Basement

Directory based caching lets you build large SMPs Every vendor building a HUGE SMP 256 way 3x slower remote memory 8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs 64 bit addressing nUMA sensitive OS (not clear who will do it) Or Hypervisor like IBM LSF, Stanford Discowww-flash.stanford.edu/Hive/papers.html You get an expensive cluster-in-a-box with very fast network SMP -> nUMA: BIG FAT SERVERS

Outline • All God’s Chillren Got Clusters! • Technology trends imply • processors migrated to transducers • Components (Software Cyberbricks) Programming & Managing Clusters • Database experience • Parallelism via transaction processing • Parallelism via dataflow • AutoEverything, AlwaysUp

Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years • Today: • 10 Gbps per channel • 4 channels per fiber: 40 Gbps • 32 fibers/bundle = 1.2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps

CHALLENGE reduce software taxon messages Today 30 K ins + 10 ins/byte Goal: 1 K ins + .01 ins/byte Best bet: SAN/VIA Smart NICs Special protocol User-Level Net IO (like disk) Technology 10 GBps bus “now” 1 Gbps links “now” 1 Tbps links in 10 years Fast & cheap switches Standard interconnects processor-processor processor-device (=processor) Deregulation WILL work someday NetworkingBIG!! Changes coming!

TCP/IP Unix/NT 100% cpu @ 40MBps Disk Unix/NT 8% cpu @ 40MBps Why the Difference? Host does TCP/IP packetizing, checksum,… flow control small buffers Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA What if Networking Was as Cheap As Disk IO?

The Promise of SAN/VIA10x better in 2 years • Today: • wires are 10 MBps (100 Mbps Ethernet) • ~20 MBps tcp/ip saturates 2 cpus • round-trip latency is ~300 us • In two years • wires are 100 MBps (1 Gbps Ethernet, ServerNet,…) • tcp/ip ~ 100 MBps 10% of each processor • round-trip latency is 20 us • works in lab todayassumes app uses zero-copy Winsock2 api.See http://www.viarch.org/

NOW CPU: nearing 1 BIPS but CPI rising fast (2-10) so less than 100 mips 1$/mips to 10$/mips DRAM: 3 $/MB DISK: 30 $/GB TAPE: 20 GB/tape, 6 MBps Lags disk 2$/GB offline, 15$/GB nearline 2003 Forecast (10x better) CPU: 1BIPS real (smp) 0.1$ - 1$/mips DRAM: 1 Gb chip 0.1 $/MB Disk: 10 GB smart cards500GB RAID5 packs (NTinside) 3$ GB Tape ? Technology (hardware)

3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPEC marks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?

Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Sacramento 100 Memory This Campus 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min

System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.

All Device Controllers will be Cray 1’s Central Processor & Memory • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • economics (cyberbricks) • Move computation to data (minimize traffic) Tera Byte Backplane

It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB ASIC • Storage • Network • Display M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Move os +app toNIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Central Processor & Memory Tera Byte Interconnect + Super Computer Adapters Conventional Radical

When Every Device is a Node • It’s a distributed system (cluster) • It’s a very parallel system • Programming model is interesting • Some progress in Database/transaction processing • lots of little independent requests (transaction processing) • Dataflow for big requests (data mining) • Some progress in Web servers • like transaction processing (lots of little requests) • Some spectacular failures • MPPs

Software CyberBricks: Objects! • It’s a zoo • Objects and 3-tier computing (transactions) • Give natural distribution & parallelism • Give remote management! • TP & Web: Dispatch RPCs to pool of object servers • ActiveX controls: 1B$ business today! • JavaBeans: ? B$ business today!

Objects are Software CyberBricks productivity breakthrough (plug ins) manageability breakthrough (modules) Microsoft Promise DCOM + ActiveX + IBM/Sun/Oracle/Netscape promise CORBA + Open Doc + Java Beans + Both promise parallel distributed execution centralized management of distributed system Both camps Share key goals: Encapsulation: hide implementation Polymorphism: generic opskey to GUI and reuse Uniform Naming Discovery: finding a service Fault handling: transactions Versioning: allow upgrades Transparency: local/remote Security: who has authority Shrink-wrap: minimal inheritance Automation: easy The COMponent Promise

Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it UNIX International Open software Foundation (OSF) ODBC XA / TX Object Management Group (OMG) NT OSF DCE DCE RPC GUIDs IDL DNS Kerberos Solaris COM CORBA History and Alphabet Soup 1985 X/Open 1990 1995 Open Group COM

Object Oriented ProgrammingParallelism From Many Little Jobs • Gives location transparency • ORB/web/tpmon multiplexes clients to servers • Enables distribution • Exploits embarrassingly parallel apps (transactions) • HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server

Outline • All God’s Children Got Clusters! • Technology trends imply processors migrated to transducers • Components (Software Cyberbricks) Programming & Managing Clusters • Database experience • Parallelism via transaction processing • Parallelism via dataflow • AutoEverything, AlwaysUp

We Have Modest Servers Today • Commodity servers are • memory & bus limited • improvements are coming • TPC Picture Shows • Software does well given the hardware at hand • UNIX boxes are 3x faster and 10x more expensive (cost/transaction is 3x more) • MVS boxes are off scale expensive

Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits MemoryRead/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps

Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 100 second SCAN. 1 Terabyte 1 Terabyte BANDWIDTH 10 GB/s 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Adapter ~30 MBps PCI ~70 MBps Adapter Memory Read/Write ~150 MBps Adapter PCI Adapter Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write 120 MBps

Kinds of Parallel Execution Pipeline Any Any Sequential Sequential Program Program Sequential Sequential Any Any Sequential Sequential Sequential Sequential Partition outputs split N ways inputs merge M ways Program Program

Data Flow ProgrammingPrefetch & Postwrite Hide Latency Can't wait for the data to arrive(2,000 years!) Memory that gets the data in advance( 100MB/S) Solution: Pipeline from storage (tape, disc...) to cpu cache Pipeline results to destination Latency

Why are Relational OperatorsSo Successful for Parallelism? • Relational data modeluniform operators • on uniform data stream • closed under composition • Each operator consumes 1 or 2 input streams • Each stream is a uniform collection of data • Sequential data in and out: Pure dataflow • partitioning some operators (e.g. aggregates, non-equi-join, sort,..) • requires innovation • AUTOMATIC PARALLELISM

Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution)

Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image • Assign one process per processor/disk: • find images with right data & location • analyze image, if 70% snow, return it Landsat Answer date loc image image 33N 120W . . . . . . . 34N 120W 1/2/72 . . . . . .. . . 4/8/95 date, location, & image tests

Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Good for equi-joins, range queries group-by Good for equi-joins Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

N X M Data Streams M Consumers N producers River Data Rivers: Split + Merge Streams • Producers add records to the river, • Consumers consume records from the river • Purely sequential programming. • River does flow control and buffering • does partition and merge of data records • River = Split/Merge in Gamma = Exchange operator in Volcano.

Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism

N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows

Parallel & Distributed Computing