670 likes | 908 Views
Scaleable Computing Jim Gray Microsoft Research Gray@Microsoft.com http://research.Microsoft.com/~Gray/talks/. Outline The bandwidth revolution ScaleUp, ScaleOut TerraServer (Barclay, Slutz, Gray). Gilder’s Law: 3x bandwidth/year for 25 more years. Today: 10 Gbps per channel
E N D
Scaleable ComputingJim GrayMicrosoft ResearchGray@Microsoft.comhttp://research.Microsoft.com/~Gray/talks/ • Outline • The bandwidth revolution • ScaleUp, ScaleOut • TerraServer (Barclay, Slutz, Gray) Gray @ Nortel 20 April 1999
Gilder’s Law: 3x bandwidth/year for 25 more years • Today: • 10 Gbps per channel • 4 channels per fiber: 40 Gbps • 32 fibers/bundle = 1.2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps Gray @ Nortel 20 April 1999
Software improving User-level Net-IO Software Challenge reduce software taxon messages Today 30 K ins + 10 ins/byte Goal: 1 K ins + .01 ins/byte Technology 1 GBps bus “now” 1 Gbps links “now” 1 Tbps links in 10 years Fast & cheap switches Standard wires for interconnect processor-processor processor-device (=processor) Deregulation WILL work someday NetworkingBIG!! Changes coming! Gray @ Nortel 20 April 1999
NOW CPU: nearing 1 BIPS but CPI rising fast (2-10) so less than 100 mips 1$/mips to 10$/mips DRAM: 3 $/MB DISK: 20 $/GB TAPE: 20 GB/tape, 6 MBps Lags disk 2$/GB offline, 15$/GB nearline BUS/SAN: 10/1 GBps WAN: 0.1 Mbps 2003 Forecast (10x better) CPU: 1bips real (smp) 0.1$ - 1$/mips DRAM: 1 Gb chip 0.1 $/MB Disk: 10 GB smart cards500GB RAID5 packs (NTinside) 3$ GB BUS/SAN: 100/10 GBps WAN: 1 Gbps Technology (hardware) Gray @ Nortel 20 April 1999
App App Microsoft SAN InfrastructureWinSock Direct Path 110 MBps (that’s B not b) 10% cpu (not 200%) Network faster than most IO attachments Winsock Winsock Switch MsAfd MsAfd HwSPI U U VIA K K AFD AFD TCP TCP IP IP NDIS NDIS MiniPort MiniPort HW HW Gray @ Nortel 20 April 1999
RIP FDDI RIP ATM RIP FC RIP SCI RIP ? RIP SCSI SAN: Standard Interconnect Gbps SAN: 110 MBps • LAN faster than memory bus? • 1 GBps links in lab. • 100$ port cost soon • Port is computer • Winsock: 110 MBps(10% cpu utilization at each end) PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps Gray @ Nortel 20 April 1999
Outline • The bandwidth revolution • ScaleUp, ScaleOut • TerraServer (Barclay, Slutz, Gray) Gray @ Nortel 20 April 1999
Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Sacramento 100 Memory This Campus 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min Gray @ Nortel 20 April 1999
System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip. Gray @ Nortel 20 April 1999
SMP Super Server Departmental Server Personal System ScaleabilityScale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs
There'll be Billions Trillions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous Gray @ Nortel 20 April 1999
Trillions Billions Of ClientsNeed Millions Of Servers Billions • All clients networked to servers • May be nomadicor on-demand • Fast clients wantfaster servers • Servers provide • Shared Data • Control • Coordination • Communication Clients Mobileclients Fixedclients Servers Server Super server Gray @ Nortel 20 April 1999
Dedicated Windows terminal Net PC Existing, Desktop PC MS-DOS, UNIX, Mac clients Thin Client Support (FAT SERVERS )TSO comes to NTlower per-client costs Windows NT Server Terminal Server Gray @ Nortel 20 April 1999
Windows 2000IntelliMirror™ • Extends CMU Coda File System ideas • Files and settings mirrored on client and server • Great for disconnected users • Facilitates roaming • Easy to replace PCs • Optimizes network performance FAT STORAGE SERVERS Gray @ Nortel 20 April 1999
Directory based caching lets you build large SMPs Every vendor building a HUGE SMP 256 way 3x slower remote memory 8-level memory hierarchy L1, L2 cache DRAM remote DRAM (3, 6, 9,…) Disk cache Disk Tape cache Tape Needs 64 bit addressing nUMA sensitive OS (not clear who will do it) Or Hypervisor like IBM LSF, Stanford Discowww-flash.stanford.edu/Hive/papers.html Not certain what happens next SMP -> nUMA: BIG FAT SERVERS Gray @ Nortel 20 April 1999
3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Nano Micro 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance & Management? Gray @ Nortel 20 April 1999
4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)The Bricks of Cyberspace • Cost 1,000 $ • Come with • NT • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks Gray @ Nortel 20 April 1999
CPU 50 GB Disc 5 GB RAM Super Server: 4T Machine • Array of 1,000 4B machines • 1 b ips processors • 1 B B DRAM • 10 B B disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • Affordability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work Gray @ Nortel 20 April 1999
Scale OUTClusters Have Advantages • Fault tolerance: • Spare modules mask failures • Modular growth without limits • Grow by adding small modules • Parallel data search • Use multiple processors and disks • Clients and servers made from the same stuff • Inexpensive: built with commodity CyberBricks Gray @ Nortel 20 April 1999
1988: IBM DB2 + CICS Mainframe65 tps • IBM 4391 • Simulated network of 800 clients • 2m$ computer • Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB Gray @ Nortel 20 April 1999
1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem) • A dozen people (1.8M$/y) • False floor, 2 rooms of machines Admin expert 32 node processor array Performance expert Hardware experts Simulate 25,600 clients Network expert Auditor Manager 40 GB disk array (80 drives) Gray @ Nortel 20 April 1999 DB expert OS expert
1997: 9 years later1 Person and 1 box = 1250 tps • 1 Breadbox ~ 5x 1987 machine room • 23 GB is hand-held • One person does all the work • Cost/tps is 100,000x less5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays Gray @ Nortel 20 April 1999
mainframe mini price micro time What Happened?Where did the 100,000x come from? • Moore’s law: 100X (at most) • Software improvements: 10X (at most) • Commodity Pricing: 100X (at least) • Total 100,000X • 100x from commodity • (DBMS was 100K$ to start: now 1k$ to start • IBM 390 MIPS is 7.5K$ today • Intel MIPS is 10$ today • Commodity disk is 50$/GB vs 1,500$/GB • ... Gray @ Nortel 20 April 1999
Kilo Mega Giga Tera Peta Exa Zetta Yotta Computers shrink to a point • Disks 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk is super computer! • This is already true of printers and “terminals” Gray @ Nortel 20 April 1999
All Device Controllers will be Cray 1’s Central Processor & Memory • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • economics (cyberbricks) • Move computation to data (minimize traffic) Tera Byte Backplane Gray @ Nortel 20 April 1999
It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine. Gray @ Nortel 20 April 1999
Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB ASIC • Storage • Network • Display M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC Gray @ Nortel 20 April 1999
Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: DCOM. Cluster parallelism is VERY important. h Central Processor & Memory Implications Conventional Radical Gray @ Nortel 20 April 1999
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams VIAL/VIPL VIAL/VIPL Wire(s) Gray @ Nortel 20 April 1999
Disk = Node • has magnetic storage (100 GB?) • has processor & DRAM • has SAN attachment • has execution environment Applications Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel Gray @ Nortel 20 April 1999
SMP Super Server Departmental Server Personal System ScaleabilityScale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs
HotMail: ~300 Computers • FreeBSD and Solaris Gray @ Nortel 20 April 1999
Microsoft.com: ~150 nodes Gray @ Nortel 20 April 1999
Other Clusters • 16-node Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Compaq Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps) Gray @ Nortel 20 April 1999
Berkeley NOW (network of workstations) Projecthttp://now.cs.berkeley.edu/ • 105 nodes • Sun UltraSparc 170, 128 MB, 2x2GB disk • Myrinet interconnect (2x160MBps per node) • SBus (30MBps) limited • GLUNIX layer above Solaris • Inktomi (HotBot search) • NAS Parallel Benchmarks • Crypto cracker • Sort 9 GB per second Gray @ Nortel 20 April 1999
NCSA Super Cluster • National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html Gray @ Nortel 20 April 1999
Outline • The bandwidth revolution • ScaleUp, ScaleOut • TerraServer (Barclay, Slutz, Gray)A scaleup example Gray @ Nortel 20 April 1999
Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) • 15 PB by 2007 • Federal Clearing house: images of checks • 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program • 10 Exabytes (???!!) Gray @ Nortel 20 April 1999
A letter Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A Movie Library of Congress (text) LoC (image) All Disks All Tapes Info Capture • You can record everything you see or hear or read. • What would you do with it? • How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write: 8 GB (words) See: http://www.lesk.com/mlesk/ksg97/ksg.html Gray @ Nortel 20 April 1999
Michael Lesk’s Pointswww.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Searchwill be a key enabling technology. Gray @ Nortel 20 April 1999
The TerraServerhttp://www.terraserver.microsoft.com/ Gray @ Nortel 20 April 1999
200x200 m tile ,4 x,4 km browse .8 x .8 km 8m thumbnail 1.6x 1.6 km “city view” Database & application UI • Concept: User navigates an ‘almost seamless’ image of earth • Coverage: Range from 70ºN to 70ºStoday: 35% U.S., 1% outside U.S. • Source Imagery: • 4 TB 1sq meter/pixel Aerial (USGS - 60,000 46Mb B&W- 151Mb Color IR files) • 1 TB 1.56 meter/pixel Satellite (Spin-2 - 2400 300 Mb B&W) • Display Imagery: 200x200 pixel images, subsample to build image pyramid • Nav Tools: • 1.5 m place names • “Click-on” Coverage map • Expedia & Virtual Globe map • Pick of the week Gray @ Nortel 20 April 1999
DRG 50,000 TopoMaps adding now 1 TB Spin-2 WorldWide New Data Coming Image Data 4 TB 6TB Coming USGS “DOQ” Gray @ Nortel 20 April 1999
WebClient Internet Information Server 4.0 HTML JavaViewer Terra-Server Active Server Pages 24 IE 3…5 Netscape 3…4 Active Data Object ODBC Terra-ServerStored Procedures 19 SQL Server 7.0 Terra-Server DB 39(14 Img) (8 Place) Microsoft Site Serve EE 3.0 SPIN-2/USGS Store Active Server Pages 13 Terra-Server Web Site Image DeliveryApplication SQL Server Image Commerce Site(s) Software Architecture The Internet Gray @ Nortel 20 April 1999
Expedia Name Map Search 22% 40% Famous Places 18% Geo Coverage Coordinate Map 1% 19% How Images are Found Gray @ Nortel 20 April 1999
Summary Total Max Average Unique Users 17 M 150 k 69 k Sessions 24 M 172 k 94 k Hits 1.7 B 29 M 6.8 M Page Views 274 M 1.1 M 6.6 M DB Queries 1.5 B 18 M 5.8 M Image Xfers 1.3 B 15 M 5.0 M TerraServer: Lots of Web Hits • Today: • 1.7 billion web hits • 1 TB, largest SQL DB on the Web • 100 qps average, 1,000 Qps peak • 1.5 B SQL queries so far As of Feb 28, 1999 Gray @ Nortel 20 April 1999
Image Data & Meta Data Country Name State Name Spin Frame Meta Theme Meta Information Tile Meta Img Meta Place Name PlaceType Feature Type Where Am I Browse Img Tile Img Jump Img Thumb Img Logical Schema Gazetteer Index on • image, place, type • image, state, type • image, state, country, type • image, place, state, type • image, place, country, type all lookups are fast Lookup by UGrid or ZGrid ID plus resolution Lookups are fast. Indices are in DRAM (auto-magically by SQL) SQL manages all the tiles and indices Images are brought in on demand Gray @ Nortel 20 April 1999
Staging Disk DLTTape “tar” Metadata Load DB Active Server Pages Cut & Load Scheduling System Image Cutter Merge ODBC Tx TerraLoader ODBC TX Dither Image Pyramid From base ODBC Tx Image Load and Update JPEG tiles TerraServer SQLDBMS Gray @ Nortel 20 April 1999
TerraServer Administrator Web Site • Accessible by Microsoft, SPIN-2, and USGS • Web browser forms to: • Edit Famous Places list • Modify Image Status fields • Define new TerraServer Administrators Gray @ Nortel 20 April 1999
Load & Backup&Recovery • Backup and Recovery • Using Legato Networker integrated with SQL Backup/Restore Utility • Fast, incremental, differential, online • Restore • Fast, incremental (file oriented), not online. • SQL Server Enterprise Manager • DBA Maintenance • SQL Performance Monitor Gray @ Nortel 20 April 1999