480 likes | 491 Views
Scaleable WindowsNT?. Jim Gray Microsoft Research Gray@Microsoft.com http://research.Microsoft.com/~Gray. Outline. What is Scalability? Why does Microsoft care about ScaleUp Current ScaleUp Status? NT5 & SQL7 & Exchange. SMP. Super Server. Departmental. Server. Personal. System.
E N D
Scaleable WindowsNT? • Jim GrayMicrosoft Research Gray@Microsoft.comhttp://research.Microsoft.com/~Gray
Outline • What is Scalability? • Why does Microsoft care about ScaleUp • Current ScaleUp Status? • NT5 & SQL7 & Exchange
SMP Super Server Departmental Server Personal System Scale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs
Billions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous
Billions Of ClientsNeed Millions Of Servers • All clients networked to servers • May be nomadicor on-demand • Fast clients wantfaster servers • Servers provide • Shared Data • Control • Coordination • Communication Clients Mobileclients Fixedclients Servers Server Super server
3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?
Outline • What is Scalability • Why does Microsoft care about ScaleUp • Current ScaleUp Status? • NT5 & SQL7 & Exchange
Scalability 100 millionweb hits 1 billion transactions • Scale up: to large SMP nodes • Scale out: to clusters of SMP nodes 1.8 million mail messages 4 terabytes of data
“Commercial” NT Clusters • 16-node Tandem Cluster • 64 cpus • 2 TB of disk • Decision support • 45-node Compaq Cluster • 140 cpus • 14 GB DRAM • 4 TB RAID disk • OLTP (Debit Credit) • 1 B tpd (14 k tps)
Tandem Oracle/NT • 27,383 tpmC • 71.50 $/tpmC • 4 x 6 cpus • 384 disks=2.7 TB
Billion Transactions per Day Project • Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks • All off-the-shelf parts • Using SQL Server & DTC distributed transactions • DebitCredit Transaction • Each node has 1/20 th of the DB • Each node does 1/20 th of the work • 15% of the transactions are “distributed”
Type nodes CPUs DRAM ctlrs disks RAID space 20 20x 20x 20x 20x 20x Workflow Compaq MTS Proliant 2 128 1 1 2 GB 2500 20 20x 20x 20x 20x 20x Compaq 36x4.2GB SQL Server Proliant 4 512 4 7x9.1GB 130 GB 5000 Distributed 5 5x 5x 5x 5x 5x Transaction Compaq Coordinator Proliant 4 256 1 3 8 GB 5000 TOTAL 45 140 13 GB 105 895 3 TB Billion Transactions Per Day Hardware • 45 nodes (Compaq Proliant) • Clustered with 100 Mbps Switched Ethernet • 140 cpu, 13 GB, 3 TB.
How Much Is 1 Billion Tpd? • 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) • ATT • 185 million calls per peak day (worldwide) • Visa ~20 million tpd • 400 million customers • 250K ATMs worldwide • 7 billion transactions (card+cheque) in 1994 • New York Stock Exchange • 600,000 tpd • Bank of America • 20 million tpd checks cleared (more than any other bank) • 1.4 million tpd ATM transactions • Worldwide Airlines Reservations: 250 Mtpd
SQL SQL SQL SQL SQL SQL Infinite, Ubiquitous ScalingRedefining the rules Per Sec Per Min Per Day 10K TPC 166 10,000 14,400,000 1 BTPD 11,574 694,444 1,000,000,000 1.4 BTPD 16,204 972,222 1,400,000,000 IIS MTS All ShippingProducts! COM / ActiveX
Building 11 Staging Servers (7) Ave CFG: 4xP6, Internal WWW Ave CFG: 4xP5, European Data Center premium.microsoft.com IDC Staging Servers 512 RAM, www.microsoft.com 30 GB HD (1) MOSWest (3) Ave CFG: 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Servers 512 RAM, SQLNet 30 GB HD Ave CFG: 4xP5, SQL SERVERS 50 GB HD Feeder LAN 512 RAM, SQL Consolidators (2) Router Download 30 GB HD DMZ Staging Servers Ave CFG: Replication 4xP6, Ave CFG: 4xP6, 512 RAM, FTP Router 1 GB RAM, Live SQL Servers 160 GB HD Download Server 160 GB HD SQL Reporting Ave Cost: $83K Ave CFG: 4xP6, (1) MOSWest Switched Ave CFG: FY98 Fcst: 4xP6, 2 512 RAM, Live SQL Server Ave CFG: Admin LAN 4xP6, Ethernet 512 RAM, 160 GB HD 512 RAM, 160 GB HD Ave Cost: $83K 50 GB HD FY98 Fcst: 12 search.microsoft.com msid.msn.com (1) msid.msn.com register.microsoft.com www.microsoft.com (1) (1) www.microsoft.com (2) (4) Ave CFG: 4xP6, Router (4) 512 RAM, search.microsoft.com Ave CFG: 4xP6, 30 GB HD Japan Data Center (3) 512 RAM, SQL SERVERS www.microsoft.com 50 GB HD Ave CFG: premium.microsoft.com 4xP6, (2) (3) 512 RAM, Ave CFG: 4xP6, (1) 30 GB HD home.microsoft.com 512 RAM, Ave CFG: 4xP6, home.microsoft.com Ave CFG: 4xP6, Ave Cost: $28K 160 GB HD FDDI Ring 512 RAM, (3) 512 RAM, FY98 Fcst: (4) 7 (MIS2) 50 GB HD premium.microsoft.com 30 GB HD Ave CFG: 4xP6 (2) msid.msn.com 512 RAM Ave CFG: 4xP6, activex.microsoft.com 28 GB HD 512 RAM, (1) (2) FDDI Ring Ave CFG: 4xP6, 30 GB HD Switched (MIS1) 512 RAM, Ave CFG: 4xP6, Ethernet 30 GB HD 256 RAM, 30 GB HD FTP Ave Cost: $25K cdm.microsoft.com Download Server Ave CFG: FY98 Fcst: 4xP5, 2 (1) 256 RAM, Router (1) HTTP search.microsoft.com 12 GB HD Download Servers (2) (2) Router Router Internet msid.msn.com Router (1) 2 Primary 2 Router Gigaswitch OC3 Ethernet premium.microsoft.com (100Mb/Sec Each) Internet (100 Mb/Sec Each) Router (1) www.microsoft.com Router (3) Secondary Gigaswitch 13 Router DS3 Router FTP.microsoft.com (45 Mb/Sec Each) (3) FDDI Ring Ave CFG: 4xP5, home.microsoft.com (MIS3) www.microsoft.com msid.msn.com 512 RAM, (2) 30 GB HD (5) (1) Internet register.microsoft.com Ave CFG: 4xP5, FDDI Ring (2) 256 RAM, (MIS4) 20 GB HD register.microsoft.com home.microsoft.com support.microsoft.com (1) (5) register.msn.com (2) (2) Ave CFG: 4xP6, support.microsoft.com 512 RAM, search.microsoft.com (1) 30 GB HD Microsoft.com: ~150x4 nodes (3)
NCSA Super Cluster • National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
TPC C Improved Fast(250%/year!) 40% hardware, 100% software, 100% PC Technology
Microsoft TerraServer: Scaleup to Big Databases • Build a 1 TB SQL Server database • Data must be • 1 TB • Unencumbered • Interesting to everyone everywhere • And not offensive to anyone anywhere • Loaded • 1.5 M place names from Encarta World Atlas • 3 M Sq Km from USGS (1 meter resolution) • 1 M Sq Km from Russian Space agency (2 m) • On the web (world’s largest atlas) • Sell images with commerce server.
Earth is 500 Tera-meters square USA is 10 tm2 100 TM2 land in 70ºN to 70ºS We have pictures of 6% of it 3 tsm from USGS 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with Encarta™ Atlas globe gazetteer StreetsPlus™ in the USA Someday multi-spectral image of everywhere once a day / hour 1.8x1.2 km2 tile 10x15 km2thumbnail 20x30 km2 browse image 40x60 km2 jump image Microsoft TerraServer Background
navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2) Demo
Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware
Software TerraServer Web Site Web Client ImageServer Active Server Pages Internet InformationServer 4.0 HTML JavaViewer The Internet browser MTS Terra-ServerStored Procedures Internet InfoServer 4.0 Internet InformationServer 4.0 SQL Server 7 MicrosoftSite Server EE Microsoft AutomapActiveX Server Automap Server Image DeliveryApplication SQL Server7 TerraServer DB Image Provider Site(s)
ESA LoadMgr AlphaServer4100 AlphaServer4100 60 4.3 GB Drives Image Delivery and LoadIncremental load of 4 more TB in next 18 months DLTTape “tar” \Drop’N’ LoadMgrDB DoJob Wait 4 Load DLTTape NTBackup ... Cutting Machines LoadMgr 10: ImgCutter 20: Partition 30: ThumbImg40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place ImgCutter 100mbitEtherSwitch \Drop’N’ \Images TerraServer Enterprise Storage Array STKDLTTape Library AlphaServer8400 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives
TerraServer: A Real “World” Example • Largest DB on the Web • 1.3TB • 99.95% uptime since July 1 • No downtime, period, in August • 70% of downtime for SQL software upgrades
NT Clusters (Wolfpack) • Scale DOWN to PDA: WindowsCE • Scale UP an SMP: TerraServer • Scale OUT with a cluster of machines • Single-system image • Naming • Protection/security • Management/load balance • Fault tolerance • “Wolfpack” • Hot pluggable hardware & software
Browser Server 1 Web site Database Symmetric Virtual Server Failover Example Server 1 Server 2 Web site Web site Database Database Web site files Web site files Database files Database files
Windows NT 5 (scalability features) • Better SMP support • Clusters: • 16x packs (fault tolerant clusters) • 100x mobs: arrays for manageability • SAN/VIA support • 64 bit addressing for data • Apps like SQL, Oracle, will use it for data • 64 bit API to NT comes later (in lab now). • Remote management (scripting and DCOM) • Active Directory • Veritas volume manager • Many 3rd party HSMs • Batch support
Microsoft SQL Server 7.0 • Fixes the famous performance bugs • dynamic record locking • online backup, quick recovery…. • 64 bit addressing buffer pool • SMP parallelism and better SMP support • Built in OLAP (cubes and MOLAP) • Scale down to Win9x • Improved management interfaces • Data transform services (for warehouses)
Outline • What is Scalability • Why does Microsoft care about ScaleUp • Current ScaleUp Status? • NT5 & SQL7
end Other slides would be interesting, but...
Interesting “other slides”No time for them but... • How much information is there? • IO bandwidth in the Intel world • Intelligent disks • SAN/VIA • NT Cluster Sort
Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) • 15 PB by 2007 • Federal Clearing house: images of checks • 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program • 10 Exabytes (???!!)
A letter Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A Movie Library of Congress (text) LoC (image) All Disks All Tapes Info Capture • You can record everything you see or hear or read. • What would you do with it? • How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write: 8 GB (words) See: http://www.lesk.com/mlesk/ksg97/ksg.html
Michael Lesk’s Pointswww.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Searchwill be a key enabling technology.
PAP (peak advertised Performance) vsRAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application 10-15 MBps Data 7.2 MB/s File System SCSI Buffers Disk 133 MBps PCI 7.2 MB/s
Reads are easy, writes are hard Async write can match WCE. PAP vs RAP 422 MBps 142 MBps SCSI Disks Application Data 40 MBps 10-15 MBps 31 MBps File System 9 MBps 133 MBps SCSI 72 MBps PCI
Adapter ~30 MBps PCI ~70 MBps Adapter Memory Read/Write ~150 MBps Adapter PCI Adapter Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write 120 MBps
Year 2002 Disks • Big disk (10 $/GB) • 3” • 100 GB • 150 kaps (k accesses per second) • 20 MBps sequential • Small disk (20 $/GB) • 3” • 4 GB • 100 kaps • 10 MBps sequential • Both running Windows NT™ 7.0?(see below for why)
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams VIAL/VIPL h Wire(s)
RIP FDDI RIP ATM RIP FC RIP SCI RIP ? RIP SCSI SAN: Standard Interconnect Gbps Ethernet: 110 MBps • LAN faster than memory bus? • 1 GBps links in lab. • 300$ port cost soon • Port is computer PCI 32: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps
PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec
AAA AAA AAA AAA AAA AAA BBB BBB BBB BBB BBB BBB CCC CCC CCC CCC CCC CCC Cluster Sort Conceptual Model • Multiple Data Sources • Multiple Data Destinations • Multiple nodes • Disks -> Sockets -> Disk -> Disk A AAA BBB CCC B C AAA BBB CCC AAA BBB CCC
Cluster Install & Execute • If this is to be used by others, • it must be: • Easy to install • Easy to execute • Installations of distributed systems take • time and can be tedious. (AM2, GluGuard) • Parallel Remote execution is • non-trivial. (GLUnix, LSF) • How do we keep this “simple” and “built-in” to NTClusterSort ?
Remote Install • Add Registry entry to each remote node. RegConnectRegistry() RegCreateKeyEx()
MULT_QI COSERVERINFO HANDLE HANDLE HANDLE Sort() Sort() Sort() Cluster Execution • Setup : • MULTI_QI struct • COSERVERINFO struct • CoCreateInstanceEx() • Retrieve remote object handle • from MULTI_QI struct • Invoke methods as usual