430 likes | 444 Views
Scaleable Servers. Jim Gray Microsoft Gray@Microsoft.com http://www.research.Microsoft.com/~Gray. Thesis: Scaleable Servers. Scaleable Servers Commodity hardware allows new applications New applications need huge servers Clients and servers are built of the same “stuff”
E N D
Scaleable Servers Jim Gray Microsoft Gray@Microsoft.com http://www.research.Microsoft.com/~Gray
Thesis: Scaleable Servers • Scaleable Servers • Commodity hardware allows new applications • New applications need huge servers • Clients and servers are built of the same “stuff” • Commodity software and • Commodity hardware • Servers should be able to • Scale up (grow node by adding CPUs, disks, networks) • Scale out (grow by adding nodes) • Scale down (can start small) • Key software technologies • Objects, Transactions, Clusters, Parallelism
1987: 256 tps Benchmark • 14 M$ computer (Tandem) • A dozen people • False floor, 2 rooms of machines Admin expert Hardware experts A 32 node processor array Auditor Network expert Simulate 25,600 clients Manager Performance expert OS expert DB expert A 40 GB disk array (80 drives)
1988: DB2 + CICS Mainframe65 tps • IBM 4391 • Simulated network of 800 clients • 2m$ computer • Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x .5GB
1997: 10 years later1 Person and 1 box = 1250 tps • 1 Breadbox ~ 5x 1987 machine room • 23 GB is hand-held • One person does all the work • Cost/tps is 1,000x less25 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays
mainframe mini price micro time What Happened? • Moore’s law: Things get 4x better every 3 years(applies to computers, storage, and networks) • New Economics: Commodityclass price/mips software $/mips k$/yearmainframe 10,000 100 minicomputer 100 10microcomputer 10 1 • GUI: Human - computer tradeoffoptimize for people, not computers
Billions Of ClientsNeed Millions Of Servers • All clients networked to servers • May be nomadicor on-demand • Fast clients wantfaster servers • Servers provide • Shared Data • Control • Coordination • Communication Clients Mobileclients Fixedclients Servers Server Super server
3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?
CPU 50 GB Disc 5 GB RAM Future Super Server:4T Machine • Array of 1,000 4B machines • 1 bps processors • 1 BB DRAM • 10 BB disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • Affordability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work
The Hardware Is In Place…And then a miracle occurs ? • SNAP: scaleable networkand platforms • Commodity-distributedOS built on: • Commodity platforms • Commodity networkinterconnect • Enables parallel applications
Thesis: Scaleable Servers • Scaleable Servers • Commodity hardware allows new applications • New applications need huge servers • Clients and servers are built of the same “stuff” • Commodity software and • Commodity hardware • Servers should be able to • Scale up (grow node by adding CPUs, disks, networks) • Scale out (grow by adding nodes) • Scale down (can start small) • Key software technologies • Objects, Transactions, Clusters, Parallelism
Scaleable ServersBOTH SMP And Cluster Grow up with SMP; 4xP6is now standard Grow out with cluster Cluster has inexpensive parts SMP superserver Departmentalserver Personalsystem Clusterof PCs
SMPs Have Advantages • Single system image easier to manage, easier to program threads in shared memory, disk, Net • 4x SMP is commodity • Software capable of 16x • Problems: • >4 not commodity • Scale-down problem (starter systems expensive) • There is a BIGGEST one SMP superserver Departmentalserver Personalsystem
Tpc-C Web-Based Benchmarks • Client is a Web browser (9,200 of them!) • Submits • Order • Invoice • Query to server via Web page interface • Web server translates to DB • SQL does DB work • Net: • easy to implement • performance is GREAT! HTTP IIS = Web ODBC SQL
TPC-C Shows How Far SMPs have come • Performance is amazing: • 2,000 users is the min! • 30,000 users on a 4x12 alpha cluster (Oracle) • Peak Performance: 30,390 tpmC @ $305/tpmC (Oracle/DEC) • Best Price/Perf: 7,693 tpmC @ $43/tpmC (MS SQL/Dell) • graphs show UNIX high price & diseconomy of scaleup
TPC C SMP Performance • SMPs do offer speedup • but 4x P6 is better than some 18x MIPSco
What Happens To Prices? • No expensive UNIX front end (20$/tpmC) • No expensive TP monitor software (10$/tpmC) • => 65$/tpmC
Building the Largest NT Node • Build a 1 TB SQL Server database • Show off NT and SQL Server Scaleability • Stress test the product • Demo it on the Internet • WWW accessible by anyone • So data must be • 1 TB • Unencumbered • Interesting to everyone everywhere • AND not offensive to anyone anywhere
What’s TeraByte? • 1 Terabyte: 1,000,000,000 business letters 150 miles of book shelf 100,000,000 book pages 15 miles of book shelf 50,000,000 FAX images 7 miles of book shelf 10,000,000 TV pictures (mpeg) 10 days of video 4,000 LandSat images 16 earth images (100m) 100,000,000 web page 10 copies of the web HTML • Library of Congress (in ASCII) is 25 TB 1980: $200 million of disc 10,000 discs $5 million of tape silo 10,000 tapes 1997: 200 k$ of magnetic disc 48 discs 30 k$ nearline tape 20 tapes Terror Byte !
Microsoft BackOffice SPIN-2 The Plan • DEC Alpha + • 324 StorageWorks Drives (1.4 TB) • 30K BTU, 8 KW, 1.5 metric tons. • SQL 7.0 • USGS data(1 meter) • Russian Spacedata (2 meter) DEC 4100 4 x 400 Mhz Alpha Processors 4GB DRAM
300 GBSrc: USGS & UCSB UCSB missing some DOQs DOQ 500 GB Spin-2 WorldWide LoB App New Data Coming Image Data Sources
DOQ coverage of the US • 1 Meter images of many places • Problems: • most of data not yet published • interesting places missing (LA, Portland, SD, Anchorage,…) • Loaded published 130 GB. • CRDA for unpublished 3 TB
SPIN-2 Coverage • The rest of the world • The US Government can’t help, but.... • The Russian Space Agency is eager to cooperate. • 2 Meter Geo Rectified imagery of anywhere • More data coming, Earth has ~ 500 TeraMeters2 • => ~30 Tera Bytes of Land at 2x2 Meter • => we need 3% of the land (Urban World = the red stuff)
SMP superserver Departmentalserver Personalsystem Grow UP and OUT 1 Terabyte DB • Cluster: • a collection of nodes • as easy to program and manage as a single node 1 billion transactions per day
Clusters Have Advantages • Clients and servers made from the same stuff • Inexpensive: • Built with commodity components • Fault tolerance: • Spare modules mask failures • Modular growth • Grow by adding small modules • Unlimited growth: no biggest one
Billion Transactions per Day Project • Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks • All off-the-shelf parts • Using SQL Server & DTC distributed transactions • DebitCredit Transaction • Each node has 1/20 th of the DB • Each node does 1/20 th of the work • 15% of the transactions are “distributed”
How Much Is 1 Billion Transactions Per Day? • 1 Btpd = 11,574 tps (transactions per second)~ 700,000 tpm (transactions/minute) • AT&T • 185 million calls (peak day worldwide) • Visa ~20 M tpd • 400 M customers • 250,000 ATMs worldwide • 7 billion transactions / year (card+cheque) in 1994 Millions of transactions per day 1,000. 100. 10. Mtpd 1. 0.1 AT&T Visa BofA NYSE 1 Btpd
Type nodes CPUs DRAM ctlrs disks RAID space 20 20x 20x 20x 20x 20x Workflow Compaq MTS Proliant 2 128 1 1 2 GB 2500 20 20x 20x 20x 20x 20x Compaq 36x4.2GB SQL Server Proliant 4 512 4 7x9.1GB 130 GB 5000 Distributed 5 5x 5x 5x 5x 5x Transaction Compaq Coordinator Proliant 4 256 1 3 8 GB 5000 TOTAL 45 140 13 GB 105 895 3 TB Billion Transactions Per Day Hardware • 45 nodes (Compaq Proliant) • Clustered with 100 Mbps Switched Ethernet • 140 cpu, 13 GB, 3 TB.
1.2 B tpd • 1 B tpd ran for 24 hrs. • Sized for 30 days • Linear growth • 5 micro-dollars per transaction • Out-of-the-box software • Off-the-shelf hardware • AMAZING!
Other Stunts • 100 M Web Hits/day on one server • (=1,300 hits/sec, Web Mark HTML server) • Email server (exchange) • 50 GB database (up from 16GB, limit now 16TB) • 50 k POP3 users (1.5 M msg/day) • 64-bit addressing SQL Server • SAP Failover • Theme: • conventional stuff is easy
Thesis: Scaleable Servers • Scaleable Servers • Commodity hardware allows new applications • New applications need huge servers • Clients and servers are built of the same “stuff” • Commodity software and • Commodity hardware • Servers should be able to • Scale up (grow node by adding CPUs, disks, networks) • Scale out (grow by adding nodes) • Scale down (can start small) • Key software technologies • Objects, Transactions, Clusters, Parallelism
ParallelismThe OTHER aspect of clusters • Clusters of machines allow two kinds of parallelism • Many little jobs: online transaction processing • TPC-A, B, C… • A few big jobs: data search and analysis • TPC-D, DSS, OLAP • Both give automatic parallelism
Kinds of Parallel Execution Any Any Sequential Sequential Pipeline Program Program Partition outputs split N ways inputs merge M ways Any Any Sequential Sequential Program Program
Data RiversSplit + Merge Streams N X M Data Streams M Consumers N producers River • Producers add records to the river, • Consumers consume records from the river • Purely sequential programming. • River does flow control and buffering • does partition and merge of data records • River = Split/Merge in Gamma = Exchange operator in Volcano.
Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelism
N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows
Clusters (Plumbing) • Single system image • naming • protection/security • management/load balance • Fault Tolerance • Wolfpack • Hot Pluggable hardware & Software
Key goals: Easy: to install, manage, program Reliable: better than a single node Scaleable: added parts add power Microsoft & 60 vendors defining NT clusters Almost all big hardware and software vendors involved No special hardware needed - but it may help Enables Commodity fault-tolerance Commodity parallelism (data mining, virtual reality…) Also great for workgroups! Initial: two-node failover Beta testing since December96 SAP, Microsoft, Oracle giving demos. File, print, Internet, mail, DB, other services Easy to manage Each node can be 4x (or more) SMP Next (NT5) “Wolfpack” is modest size cluster About 16 nodes (so 64 to 128 CPUs) No hard limit, algorithms designedto go further Windows NT clusters
New MPP & NewOS New MPP & NewOS New MPP & NewOS New MPP & NewOS New App New App New App New App So, What’s New? • When slices cost 50k$, you buy 10 or 20. • When slices cost 5k$ you buy 100 or 200. • Manageability, programmability, usability become key issues (total cost of ownership). • PCs are MUCH easier to use and program MPP Vicious Cycle No Customers! Apps CP/Commodity Virtuous Cycle: Standards allow progress and investment protection Standard platform Customers
Thesis: Scaleable Servers • Scaleable Servers • Commodity hardware allows new applications • New applications need huge servers • Clients and servers are built of the same “stuff” • Commodity software and • Commodity hardware • Servers should be able to • Scale up (grow node by adding CPUs, disks, networks) • Scale out (grow by adding nodes) • Scale down (can start small) • Key software technologies • Objects, Transactions, Clusters, Parallelism
Database Spreadsheet Photos Mail Map Document Objects Meet DatabasesThe basis for universaldata servers, access, & integration • object-oriented (COM oriented) programming interface to data • Breaks DBMS into components • Anything can be a data source • Optimization/navigation “on top of” other data sources • A way to componentized a DBMS • Makes an RDBMS and O-RDBMS (assumes optimizer understands objects) DBMS engine