1 / 65

What Happens When Processing Storage Bandwidth are Free and Infinite?

What Happens When Processing Storage Bandwidth are Free and Infinite?. Jim Gray Microsoft Research. Outline. Clusters of Hardware CyberBricks all nodes are very intelligent Processing migrates to where the power is Disk, network, display controllers have full-blown OS

twhitson
Download Presentation

What Happens When Processing Storage Bandwidth are Free and Infinite?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Happens WhenProcessingStorageBandwidth are Free and Infinite? Jim Gray Microsoft Research

  2. Outline • Clusters of Hardware CyberBricks • all nodes are very intelligent • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. • Software CyberBricks • standard way to interconnect intelligent nodes • needs execution model • needs parallelism

  3. When Computers & Communication are Free • Traditional computer industry is 0 B$/year • All the costs are in • Content (good) • System Management (bad) • A vendor claims it costs 8$/MB/year to manage disk storage. • => WebTV (1GB drive) costs 8,000$/year to manage! • => 10 PB DB costs 80 Billion $/year to manage! • Automatic management is ESSENTIAL • In the mean time….

  4. 1980 Rule of Thumb • You need a systems’ programmer per MIPS • You need a Data Administrator per 10 GB

  5. One Person per MegaBuck • 1 Breadbox ~ 5x 1987 machine room • 48 GB is hand-held • One person does all the work • Cost/tps is 1,000x less25 micro dollars per transaction • A megabuck buys 40 of these!!! 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays

  6. People are buying computers by the dozens Computers only cost 1k$/slice! Clustering them together All God’s Children Have Clusters!Buying Computing By the Slice

  7. It’s so natural,even mainframes cluster !Looking closer at usage patterns, a few models emerge Looking closer at sites, you see hierarchies bunches functional specialization A cluster is a cluster is a cluster

  8. “Commercial” NT Clusters • 16-node Tandem Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Compaq Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)

  9. Tandem Oracle/NT • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks=2.7 TB

  10. Microsoft.com: ~150x4 nodes The Microsoft.Com Site Building 11 Staging Servers Ave CFG: 4xP5, Log Processing (7) 512 RAM, Ave CFG: 4xP6, 30 GB HD 1 GB RAM, Internal WWW European Data Center Ave Cost: $35K 180 GB HD premium.microsoft.com IDC Staging Servers www.microsoft.com FY98 Fcst: 12 Ave Cost: $128K (1) FY98 Fcst: 2 MOSWest (3) Ave CFG: 4xP6, FTP Servers 512 RAM, SQLNet Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Ave Cost: $50K Router Download 30 GB HD DMZ Staging Servers FY98 Fcst: 1 Ave CFG: Replication 4xP6, Ave Cost: $28K 512 RAM, FY98 Fcst: 0 FTP Router Live SQL Servers 160 GB HD Download Server Ave Cost: $80K SQL Reporting Ave CFG: 4xP6, (1) FY98 Fcst: 1 MOSWest Switched Ave CFG: 4xP6, 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD All servers in Building11 512 RAM, 160 GB HD Ave Cost: $83K are accessable from 50 GB HD Ave Cost: $80K FY98 Fcst: 12 corpnet. Ave Cost: $35K FY98 Fcst: 2 FY98 Fcst: 2 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com Ave Cost: $43K 50 GB HD FY98 Fcst: 10 Ave CFG: premium.microsoft.com 4xP6, (2) (3) Ave Cost: $50K 512 RAM, Ave CFG: 4xP6, (1) FY98 Fcst: 17 Ave CFG: 4xP6, 30 GB HD home.microsoft.com 512 RAM, home.microsoft.com 512 RAM, Ave Cost: $28K 160 GB HD FDDI Ring (3) 50 GB HD FY98 Fcst: (4) 7 Ave Cost: $80K (MIS2) premium.microsoft.com Ave Cost: $50K FY98 Fcst: 1 Ave CFG: 4xP6 FY98 Fcst: 1 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, Ave Cost: $35K 30 GB HD Switched (MIS1) 512 RAM, FY98 Fcst: Ave CFG: 17 4xP6, Ave Cost: $35K Ethernet 30 GB HD 256 RAM, FY98 Fcst: 3 Ave Cost: $28K 30 GB HD FTP FY98 Fcst: 3 Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) Ave Cost: $24K (2) Router FY98 Fcst: 0 Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router (45 Mb/Sec Each) FDDI Ring home.microsoft.com (MIS3) www.microsoft.com msid.msn.com (2) (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD Ave Cost: $29K register.microsoft.com home.microsoft.com FY98 Fcst: 2 support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD (3) Ave Cost: $35K FY98 Fcst: 9 \\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd 12/15/97 Ave CFG: 4xP6, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 1 Ave CFG: 4xP6, 1 GB RAM, 160 GB HD Ave Cost: $83K FY98 Fcst: 2 Ave CFG: 4xP6, 512 RAM, 30 GB HD Ave Cost: $35K FY98 Fcst: 1 FTP.microsoft.com (3) Ave CFG: 4xP5, 512 RAM, 30 GB HD Ave Cost: $28K FY98 Fcst: 0

  11. HotMail: ~400 Computers

  12. Inktomi (hotbot), WebTV: > 200 nodes • Inktomi: ~250 UltraSparcs • web crawl • index crawled web and save index • Return search results on demand • Track Ads and click-thrus • ACID vs BASE (basic Availability, Serialized Eventually) • Web TV • ~200 UltraSparcs • Render pages, Provide Email • ~ 4 Network Appliance NFS file servers • A large Oracle app tracking customers

  13. Loki: Pentium Clusters for Sciencehttp://loki-www.lanl.gov/ 16 Pentium Pro Processors x 5 Fast Ethernet interfaces + 2 Gbytes RAM + 50 Gbytes Disk + 2 Fast Ethernet switches + Linux…………………... = 1.2 real Gflops for $63,000 (but that is the 1996 price) Beowulf project is similar http://cesdis.gsfc.nasa.gov/pub/people/becker/beowulf.html • Scientists want cheap mips.

  14. Intel/Sandia: 9000x1 node Ppro LLNL/IBM: 512x8 PowerPC (SP2) LNL/Cray: ? Maui Supercomputer Center 512x1 SP2 Your Tax Dollars At WorkASCI for Stockpile Stewardship

  15. Berkeley NOW (network of workstations) Projecthttp://now.cs.berkeley.edu/ • 105 nodes • Sun UltraSparc 170, 128 MB, 2x2GB disk • Myrinet interconnect (2x160MBps per node) • SBus (30MBps) limited • GLUNIX layer above Solaris • Inktomi (HotBot search) • NAS Parallel Benchmarks • Crypto cracker • Sort 9 GB per second

  16. Wisconsin COW • 40 UltraSparcs 64MB + 2x2GB disk+ Myrinet • SUN OS • Used as a compute engine

  17. Andrew Chien’s JBOBhttp://www-csag.cs.uiuc.edu/individual/achien.html • 48 nodes • 36 HP 2PIIx128 1 diskKayak boxes • 10 Compaq 2PIIx128 1 disk, Wkstation 6000 • 32-Myrinet&16-ServerNet connected • Operational • All running NT

  18. NCSA Cluster • The National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 500 Pentium cpus, 2k disks, SAN • Compaq + HP +Myricom • A Super Computer for 3M$ • Classic Fortran/MPI programming • NT + DCOM programming model

  19. 4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)The Bricks of Cyberspace • Cost 1,000 $ • Come with • NT • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks

  20. Super Server: 4T Machine CPU 50 GB Disc 5 GB RAM • Array of 1,000 4B machines • 1 b ips processors • 1 B B DRAM • 10 B B disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • Affordability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work

  21. Cluster VisionBuying Computers by the Slice • Rack & Stack • Mail-order components • Plug them into the cluster • Modular growth without limits • Grow by adding small modules • Fault tolerance: • Spare modules mask failures • Parallel execution & data search • Use multiple processors and disks • Clients and servers made from the same stuff • Inexpensive: built with commodity CyberBricks

  22. today’s PC is yesterday’s supercomputer Can use LOTS of them Main Apps changed: scientific  commercial  web Web & Transaction servers Data Mining, Web Farming Nostalgia Behemoth in the Basement

  23. Directory based caching lets you build large SMPs Every vendor building a HUGE SMP 256 way 3x slower remote memory 8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs 64 bit addressing nUMA sensitive OS (not clear who will do it) Or Hypervisor like IBM LSF, Stanford Discowww-flash.stanford.edu/Hive/papers.html You get an expensive cluster-in-a-box with very fast network SMP -> nUMA: BIG FAT SERVERS

  24. Great Debate: Shared What? CLIENTS Shared Memory (SMP) Shared Nothing (network) Shared Disk Easy to program Difficult to build Difficult to scale Hard to program Easy to build Easy to scale SGI, Sun, Sequent VMScluster, Sysplex Tandem, Teradata, SP2, NT NUMA blurs distinction, but has it’s own problems

  25. ThesisMany little beat few big 3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPEC marks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?

  26. A Hypothetical QuestionTaking things to the limit • Moore’s law 100x per decade: • Exa-instructions per second in 30 years • Exa-bit memory chips • Exa-byte disks • Gilder’s Law of the Telecosom3x/year more bandwidth 60,000x per decade! • 40 Gbps per fiber today

  27. Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years • Today: • 10 Gbps per channel • 4 channels per fiber: 40 Gbps • 32 fibers/bundle = 1.2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps

  28. CHALLENGE reduce software taxon messages Today 30 K ins + 10 ins/byte Goal: 1 K ins + .01 ins/byte Best bet: SAN/VIA Smart NICs Special protocol User-Level Net IO (like disk) Technology 10 GBps bus “now” 1 Gbps links “now” 1 Tbps links in 10 years Fast & cheap switches Standard interconnects processor-processor processor-device (=processor) Deregulation WILL work someday NetworkingBIG!! Changes coming!

  29. TCP/IP Unix/NT 100% cpu @ 40MBps Disk Unix/NT 8% cpu @ 40MBps What if Networking Was as Cheap As Disk IO? Why the Difference? Host does TCP/IP packetizing, checksum,… flow control small buffers Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA

  30. The Promise of SAN/VIA10x better in 2 years • Today: • wires are 10 MBps (100 Mbps Ethernet) • ~20 MBps tcp/ip saturates 2 cpus • round-trip latency is ~300 us • In two years • wires are 100 MBps (1 Gbps Ethernet, ServerNet,…) • tcp/ip ~ 100 MBps 10% of each processor • round-trip latency is 20 us • works in lab todayassumes app uses zero-copy Winsock2 api.See http://www.viarch.org/

  31. Functionally Specialized Cards P mips processor Today: P= 20 mips M= 2 MB • Storage • Network • Display ASIC M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

  32. It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

  33. System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.

  34. All Device Controllers will be Cray 1’s • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • economics (cyberbricks) • Move computation to data (minimize traffic) Central Processor & Memory Tera Byte Backplane

  35. With Tera Byte Interconnectand Super Computer Adapters Tera Byte Backplane • Processing is incidental to • Networking • Storage • UI • Disk Controller/NIC is • faster than device • close to device • Can borrow device package & power • So use idle capacity for computation. • Run app in device.

  36. Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Implications Tera Byte Backplane Central Processor & Memory Conventional Radical

  37. Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? DCOM? IIOP? RMI? HTTP? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams VIAL/VIPL VIAL/VIPL Wire(s)

  38. Restatement The huge clusters we saware prototypes for this: A Federation of Functionally specialized nodes Each node shrinks to a “point” device With embedded processing.Each node / device is autonomous Each talks a high-level protocol

  39. Outline • Clusters of Hardware CyberBricks • all nodes are very intelligent • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. • Software CyberBricks • standard way to interconnect intelligent nodes • needs execution model • needs parallelism

  40. Software CyberBricks: Objects! • It’s a zoo • Objects and 3-tier computing (transactions) • Give natural distribution & parallelism • Give remote management! • TP & Web: Dispatch RPCs to pool of object servers • Components are a 1B$ business today! • Need a Parallel & distributed computing model

  41. Objects are Software CyberBricks productivity breakthrough (plug ins) manageability breakthrough (modules) Microsoft: DCOM + ActiveX IBM/Sun/Oracle/Netscape: CORBA + Java Beans Both promise parallel distributed execution centralized management of distributed system Both camps Share key goals: Encapsulation: hide implementation Polymorphism: generic opskey to GUI and reuse Uniform Naming Discovery: finding a service Fault handling: transactions Versioning: allow upgrades Transparency: local/remote Security: who has authority Shrink-wrap: minimal inheritance Automation: easy The COMponent Promise

  42. History and Alphabet Soup Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it UNIX International Open software Foundation (OSF) ODBC XA / TX Object Management Group (OMG) NT OSF DCE DCE RPC GUIDs IDL DNS Kerberos Solaris COM CORBA 1985 X/Open 1990 1995 Open Group COM

  43. The OLE-COM Experience • Macintosh had Publish & Subscribe • PowerPoint needed graphs: • plugged MS Graph in as an component. • Office adopted OLE • one graph program for all of office • Internet arrived • URLs are object references, • Office is Web Enabled right away! • Office97 smaller than Office95 because of shared components • It works!!

  44. Linking And EmbeddingObjects are data modules;transactions are execution modules • Link: pointer to object somewhere else • Think URL in Internet • Embed: bytesare here • Objects may be active; can callback to subscribers

  45. The BIG PictureComponents and transactions • Software modules are objects • Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) • Standard interfaces allow software plug-ins • Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated Object RequestBroker

  46. Object Request Broker (ORB)Orchestrates RPC Transaction • Registers Servers • Manages pools of servers • Connects clients to servers • Does Naming, request-level authorization, • Provides transaction coordination • Direct and queued invocation • Old names: • Transaction Processing Monitor, • Web server, • NetWare Object-Request Broker

  47. The OO Points So Far • Objects are software Cyber Bricks • Object interconnect standards are emerging • Cyber Bricks become Federated Systems. • Next points: • put processing close to data • do parallel processing.

  48. Three Tier Computing • Clients do presentation, gather input • Clients do some workflow (Xscript) • Clients send high-level requests to ORB • ORB dispatches work-flows and business objects -- proxies for client, orchestrate flows & queues • Server-side workflow scripts call on distributed business objects to execute task Presentation workflow Application Objects Database

  49. Transaction Processing Evolution to Three TierIntelligence migrated to clients Server green screen 3270 Active • Mainframe Batch processing (centralized) • Dumb terminals & Remote Job Entry • Intelligent terminals database backends • Workflow SystemsObject Request BrokersApplication Generators Mainframe cards TP Monitor ORB

  50. Web Evolution to Three TierIntelligence migrated to clients (like TP) Mosaic NS & IE Active Web Server WAIS • Character-mode clients, smart servers • GUI Browsers - Web file servers • GUI Plugins - Web dispatchers - CGI • Smart clients - Web dispatcher (ORB)pools of app servers (ISAPI, Viper)workflow scripts at client & server archie ghopher green screen

More Related