650 likes | 663 Views
Explore cutting-edge research at Microsoft with topics such as PowerCast, FileCast, RAGS, TerraServer, billion transactions/day, WolfPack Failover, NTFS I/O measurements, and more. Learn about innovative technologies and impactful projects presented at DARPA Windows NT workshop in 1998.
E N D
Scaleable Systems Research at Microsoft(really: what we do at BARC) • Jim GrayMicrosoft Research Gray@Microsoft.comhttp://research.Microsoft.com/~GrayPresented to DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.
Outline • PowerCast, FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • WolfPack Failover • NTFS IO measurements • NT-Cluster-Sort • AlwaysUp
Telepresence • The next killer app • Space shifting: • Reduce travel • Time shifting: • Retrospective • Offer condensations • Just in time meetings. • Example: ACM 97 • NetShow and Web site. • More web visitors than attendees • People-to-People communication
Telepresence Prototypes • PowerCast: multicast PowerPoint • Streaming - pre-sends next anticipated slide • Send slides and voice rather than talking head and voice • Uses ECSRM for reliable multicast • 1000’s of receivers can join and leave any time. • No server needed; no pre-load of slides. • Cooperating with NetShow • FileCast: multicast file transfer. • Erasure encodes all packets • Receivers only need to receive as many bytes as the length of the file • Multicast IE to solve Midnight-Madness problem • NT SRM: reliable IP multicast library for NT • Spatialized Teleconference Station • Texture map faces onto spheres • Space map voices
RAGS: RAndom SQL test Generator • Microsoft spends a LOT of money on testing.(60% of development according to one source). • Idea: test SQL by • generating random correct queries • executing queries against database • compare results with SQL 6.5, DB2, Oracle, Sybase • Being used in SQL 7.0 testing. • 375 unique bugs found (since 2/97) • Very productive test tool
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) Sample Rags Generated Statement This Statement yields an error:SQLState=37000, Error=8623Internal Query Processor Error:Query processor could not produce a query plan.
Automation • Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) • Control statement attributes • complexity, kind, depth, ... • Multi-user stress tests • tests concurrency, allocation, recovery
One 4-Vendor Rags Test3 of them vs Us • 60 k Selects on MSS, DB2, Oracle, Sybase. • 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements. • Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new • Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)
Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort
Billions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous
Billions Of ClientsNeed Millions Of Servers • All clients networked to servers • May be nomadicor on-demand • Fast clients wantfaster servers • Servers provide • Shared Data • Control • Coordination • Communication Clients Mobileclients Fixedclients Servers Server Super server
3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive ThesisMany little beat few big $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multiprogram cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?
Performance = Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time Where the time goes: clock ticks for AlphaSort Components 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if not Sort Disc Wait Sort OS Disc Wait Memory Wait I-Cache Miss B-Cache D-Cache Data Miss Miss
SMP Super Server Departmental Server Personal System Scale Up and Scale Out Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs
Microsoft TerraServer: Scaleup to Big Databases • Build a 1 TB SQL Server database • Data must be • 1 TB • Unencumbered • Interesting to everyone everywhere • And not offensive to anyone anywhere • Loaded • 1.5 M place names from Encarta World Atlas • 3 M Sq Km from USGS (1 meter resolution) • 1 M Sq Km from Russian Space agency (2 m) • On the web (world’s largest atlas) • Sell images with commerce server.
Earth is 500 Tera-meters square USA is 10 tm2 100 TM2 land in 70ºN to 70ºS We have pictures of 6% of it 3 tsm from USGS 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with Encarta™ Atlas globe gazetteer StreetsPlus™ in the USA Someday multi-spectral image of everywhere once a day / hour 1.8x1.2 km2 tile 10x15 km2 thumbnail 20x30 km2 browse image 40x60 km2 jump image Microsoft TerraServer Background
US Geologic Survey 4 Tera Bytes Most data not yet published Based on a CRADA Microsoft TerraServer makes data available. 1x1 meter 4 TB Continental US New DataComing USGS “DOQ” USGS Digital Ortho Quads (DOQ)
SPIN-2 Russian Space Agency(SovInfomSputnik)SPIN-2 (Aerial Images is Worldwide Distributor) • 1.5 Meter Geo Rectified imagery of (almost) anywhere • Almost equal-area projection • De-classified satellite photos (from 200 KM), • More data coming (1 m) • Selling imagery on Internet. • Putting 2 tm2 onto Microsoft TerraServer.
navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2) Demo
Hardware DS3 48 48 48 48 48 48 48 9 GB 9 GB 9 GB 9 GB 9 GB 9 GB 9 GB Drives Drives Drives Drives Drives Drives Drives SPIN-2 Map Site Server Internet Servers 100 Mbps Ethernet Switch Web Servers Alpha Enterprise Storage Array STK Server 9710 8400 DLT Tape 8 x 440MHz Library Alpha cpus 10 GB DRAM 1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )
Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (4 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware
Software TerraServer Web Site Web Client ImageServer Active Server Pages Internet InformationServer 4.0 HTML JavaViewer The Internet browser MTS Terra-ServerStored Procedures Internet InfoServer 4.0 Internet InformationServer 4.0 SQL Server 7 MicrosoftSite Server EE Microsoft AutomapActiveX Server Automap Server Image DeliveryApplication SQL Server7 TerraServer DB Image Provider Site(s)
System Management & Maintenance • Backup and Recovery • STK 9710 Tape robot • Legato NetWorker™ • SQL Server 7 Backup & Restore • Clocked at 80 MBps (peak)(~ 200 GB/hr) • SQL Server Enterprise Mgr • DBA Maintenance • SQL Performance Monitor
H: G: E: F: Microsoft TerraServer File Group Layout • Convert 324 disks to 28 RAID5 setsplus 28 spare drives • Make 4 WinNT volumes (RAID 50)595 GB per volume • Build 30 20GB files on each volume • DB is File Group of 120 files
ESA LoadMgr AlphaServer4100 AlphaServer4100 60 4.3 GB Drives Image Delivery and LoadIncremental load of 4 more TB in next 18 months DLTTape “tar” \Drop’N’ LoadMgrDB DoJob Wait 4 Load DLTTape NTBackup ... Cutting Machines LoadMgr 10: ImgCutter 20: Partition 30: ThumbImg40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place ImgCutter 100mbitEtherSwitch \Drop’N’ \Images TerraServer Enterprise Storage Array STKDLTTape Library AlphaServer8400 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives
Technical ChallengeKey idea • Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server) • Solution: • Geo-spatial search key: • Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) • Z-transform X & Y into single Z value, build B-tree on Z • Adjacent images stored next to each other • Search Method: • Latitude and Longitude => X, Y, then Z • Select on matching Z value
Sloan Digital Sky Survey • Digital Sky • 30 TB raw • 3TB cooked (1 billion 3KB objects) • Want to scan it frequently • Using cyberbricks • Current status: • 175 MBps per node • 24 nodes => 4 GBps • 5 minutes to scan whole archive
Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) • 15 PB by 2007 • Federal Clearing house: images of checks • 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program • 10 Exabytes (???!!)
A letter Kilo Mega Giga Tera Peta Exa Zetta Yotta A novel A Movie Library of Congress (text) LoC (image) All Disks All Tapes Info Capture • You can record everything you see or hear or read. • What would you do with it? • How would you organize & analyze it? Video 8 PB per lifetime (10GBph) Audio 30 TB (10KBps) Read or write: 8 GB (words) See: http://www.lesk.com/mlesk/ksg97/ksg.html
Michael Lesk’s Pointswww.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Searchwill be a key enabling technology.
Kilo Mega Giga Tera Peta Exa Zetta Yotta A letter A novel A Movie Library of Congress (text) LoC (image) LoC (sound + cinima) All Photos All Disks All Tapes All Information!
Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort
Scalability 100 millionweb hits 1 billion transactions • Scale up: to large SMP nodes • Scale out: to clusters of SMP nodes 1.8 million mail messages 4 terabytes of data
Billion Transactions per Day Project • Built a 45-node Windows NT Cluster (with help from Intel & Compaq) > 900 disks • All off-the-shelf parts • Using SQL Server & DTC distributed transactions • DebitCredit Transaction • Each node has 1/20 th of the DB • Each node does 1/20 th of the work • 15% of the transactions are “distributed”
Type nodes CPUs DRAM ctlrs disks RAID space 20 20x 20x 20x 20x 20x Workflow Compaq MTS Proliant 2 128 1 1 2 GB 2500 20 20x 20x 20x 20x 20x Compaq 36x4.2GB SQL Server Proliant 4 512 4 7x9.1GB 130 GB 5000 Distributed 5 5x 5x 5x 5x 5x Transaction Compaq Coordinator Proliant 4 256 1 3 8 GB 5000 TOTAL 45 140 13 GB 105 895 3 TB Billion Transactions Per Day Hardware • 45 nodes (Compaq Proliant) • Clustered with 100 Mbps Switched Ethernet • 140 cpu, 13 GB, 3 TB.
1.2 B tpd • 1 B tpd ran for 24 hrs. • Out-of-the-box software • Off-the-shelf hardware • AMAZING! • Sized for 30 days • Linear growth • 5 micro-dollars per transaction
How Much Is 1 Billion Tpd? • 1 billion tpd = 11,574 tps ~ 700,000 tpm (transactions/minute) • ATT • 185 million calls per peak day (worldwide) • Visa ~20 million tpd • 400 million customers • 250K ATMs worldwide • 7 billion transactions (card+cheque) in 1994 • New York Stock Exchange • 600,000 tpd • Bank of America • 20 million tpd checks cleared (more than any other bank) • 1.4 million tpd ATM transactions • Worldwide Airlines Reservations: 250 Mtpd
NCSA Super Cluster • National Center for Supercomputing ApplicationsUniversity of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort
NT Clusters (Wolfpack) • Scale DOWN to PDA: WindowsCE • Scale UP an SMP: TerraServer • Scale OUT with a cluster of machines • Single-system image • Naming • Protection/security • Management/load balance • Fault tolerance • “Wolfpack” • Hot pluggable hardware & software
Browser Server 1 Web site Database Symmetric Virtual Server Failover Example Server 1 Server 2 Web site Web site Database Database Web site files Web site files Database files Database files
Clusters & BackOffice • Research: Instant & Transparent failover • Making BackOffice PlugNPlay on Wolfpack • Automatic install & configure • Virtual Server concept makes it easy • simpler management concept • simpler context/state migration • transparent to applications • SQL 6.5E & 7.0 Failover • MSMQ (queues), MTS (transactions).
Next Steps in Availability • Study the causes of outages • Build AlwaysUp system: • Two geographically remote sites • Users have instant and transparent failover to 2nd site. • Working with WindowsNT and SQL Server groups on this.
Outline • FileCast & Reliable Multicast • RAGS: SQL Testing • TerraServer (a big DB) • Sloan Sky Survey (CyberBricks) • Billion Transactions per day • Wolfpack Failover • NTFS IO measurements • NT-Cluster-Sort
Andromeda 2,000 Years Pluto 2 Years 1.5 hr Sacramento This Campus 10 min This Room My Head 1 min Storage Latency: How Far Away is the Data? 9 10 Tape /Optical Robot 6 Disk 10 100 Memory 10 On Board Cache 2 On Chip Cache 1 Registers
The Memory Hierarchy • Measuring & Modeling Sequential IO • Where is the bottleneck? • How does it scale with • SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory App address space Mem bus File cache Controller Adapter SCSI PCI
PAP (peak advertised Performance) vsRAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application 10-15 MBps Data 7.2 MB/s File System SCSI Buffers Disk 133 MBps PCI 7.2 MB/s
Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware islimited to 150 MBpsper processor The Best Case: Temp File, NO IO
Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits MemoryRead/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps
3 disks can saturate adapter Similar story with UltraWide CPU time goes down with request size Ftdisk (striping is cheap) = 3 Stripes and Your Out!