550 likes | 566 Views
BARC. BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/.
E N D
BARC BARCMicrosoft Bay Area Research Center6/20/97 (PUBLIC VERSION)Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingenhttp://www.research.Microsoft.com/barc/
Telepresence • The next killer app • Space shifting: • Reduce travel • Time shifting: • Retrospective • Offer condensations • Just in time meetings. • Example: ACM 97 • NetShow and Web site. • More web visitors than attendees • People-to-People communication
Telepresent Jim GemmellScaleable Reliable Multicast Outline • What Reliable Multicast is & why it is hard to scale • Fcast file transfer • ECSRM • Layered Telepresentations
Sender must repeat Link sees repeats Multiple Unicast
Pruned broadcast Unreliable IP Multicast
Difficult to scale: Sender state explosion Message implosion Reliable Multicast State:receiver 1, receiver 2, … receiver n
Receiver’s job to NACK Receiver-Reliable State:receiver 1, receiver 2, … receiver n
SRM Approaches • Hierarchy / local recovery • Forward Error Correction (FEC) • Suppression • *HYBRID* _________________________________ • Fcast is FEC only • ECSRM is suppression + FEC
Original packets Encode (copy 1st k) 1 2 k 1 2 k k+1 k+2 n Take any k Decode 1 2 k Original packets (n,k) linear block encoding
Fcast • File tranfer protocol • FEC-only • Files transmitted in parallel
Need k from each row 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 k k k k k k k k k k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+2 k+2 k+2 k+2 k+2 k+2 k+2 k+2 k+2 n n n n n n n n n File 1 File 2 Fcast send order
Files + FEC Files + FEC Files + FEC Files + FEC Sender Low loss receiver join leave High loss receiver join leave Fcast reception time
ECSRM - Erasure Correcting SRM • Combines: • suppression • erasure correction
Suppression • Delay a NACK or repair in the hopes that someone else will do it. • NACKs are multicast • After NACKing, re-set timer and wait for repair • If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.
ECSRM - adding FEC to suppression • Assign each packet to an EC group of size k • NACK: (group, # missing) • NACK of (g,c) suppresses all (g,xc). • Don’t re-send originals; send EC packets using (n,k) encoding
X 1234567 1234567 1234567 1234567 1234567 1234567 1234567 X X X X X X Example: EC group size (k) = 7 1234567
NACK: Group 1, 1 lost NACK’s suppressed …example
…example Erasure correcting packet
Normal suppression needs: 7 NACKs, 7 repairs ECSRM requires: 1 NACK, 1 repair Large group: each packet lost by someone Without FEC, 1/2 of traffic is repairs With ECSRM, only 1/8 of traffic is repairs NACK traffic reduced by factor of 7 …example: summary
Control information ECSRM Slides Annotations Fcast slide master Multicast PowerPoint Add-in
Multicast PowerPoint - Late Joiners • Viewers joining late don’t impact others with session persistent data (slide master) Fcast join leave ECSRM join time
Future Work • Adding hierarchy (e.g. PGM by Cisco) • Do we need 2 protocols?
RAGS: RAndom SQL test Generator • Microsoft spends a LOT of money on testing.(60% of development according to one source). • Idea: test SQL by • generating random correct queries • executing queries against database • compare results with SQL 6.5, DB2, Oracle, Sybase • Being used in SQL 7.0 testing. • 375 unique bugs found (since 2/97) • Very productive test tool
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) Sample Rags Generated Statement This Statement yields an error:SQLState=37000, Error=8623Internal Query Processor Error:Query processor could not produce a query plan.
Automation • Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) • Control statement attributes • complexity, kind, depth, ... • Multi-user stress tests • tests concurrency, allocation, recovery
One 4-Vendor Rags Test3 of them vs Us • 60 k Selects on MSS, DB2, Oracle, Sybase. • 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements. • Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new • Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)
RAGS Next Steps • Done: • Patents, • Papers, • talks, • tech transfer to development • Next steps: • Extend to other parts of SQL and Tsql • “Crawl” the config space (look for new holes) • Apply ideas to other domains (ole db).
Scaleup - Big Database • Build a 1 TB SQL Server database • Show off Windows NT and SQL Server scalability • Stress test the product • Data must be • 1 TB • Unencumbered • Interesting to everyone everywhere • And not offensive to anyone anywhere • Loaded • 1.1 M place names from Encarta World Atlas • 1 M Sq Km from USGS (1 meter resolution) • 2 M Sq Km from Russian Space agency (2 m) • Will be on web (world’s largest atlas) • Sell images with commerce server. • USGS CRDA: 3 TB more coming.
SPIN-2 The System • DEC Alpha + 8400 • 324 StorageWorks Drives (2.9 TB) • SQL Server 7.0 • USGS 1-meter data (30% of US) • Russian Space dataTwo meterresolutionimages
324 disks (2.9 terabytes) 8 x 440Mhz Alpha CPUs 10 GB DRAM World’s Largest PC!
Hardware SPIN-2 1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )
Software Terra-Server Web Site Web Client ImageServer Active Server Pages Internet InformationServer 4.0 HTML JavaViewer The Internet broswer MTS Terra-ServerStored Procedures Internet InfoServer 4.0 Internet InformationServer 4.0 Sphinx (SQL Server) MicrosoftSite Server EE Microsoft AutomapActiveX Server Automap Server Image DeliveryApplication SQL Server7 Terra-Server DB Image Provider Site(s)
System Management & Maintenance • Backup and Recovery • STC 9717 Tape robot • Legato NetWorker™ • Sphinx Backup/Restore Utility • Clocked at 80 MBps (peak)(~ 200 GB/hr) • SQL Server Enterprise Mgr • DBA Maintenance • SQL Performance Monitor
H: G: E: F: TerraServer File Group Layout • Convert 324 disks to 28 RAID5 setsplus 28 spare drives • Make 4 NT volumes (RAID 50)595 GB per volume • Build 30 20GB files on each volume • DB is File Group of 120 files
Demo Http://TerraWeb2
Technical ChallengeKey idea • Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server) • Solution: • Geo-spatial search key: • Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) • Z-transform X & Y into single Z value, build B-tree on Z • Adjacent images stored next to each other • Search Method: • Latitude and Longitude => X, Y, then Z • Select on matching Z value
New Since S-Day: More data: 4.8 TB USGS DOQ .5 TB Russian Bigger Server: Alpha 8400 8 proc, 8 GB RAM, 2.5 TB Disk Improved Application Better UI Uses ASP Commerce App Cut images and Load Sept & Feb Built Commerce App for USGS & Spin-2 Launch at Fed Scalability DaySQL 7 Beta 3 (6/24/98) Operate on Internet for 18 months Add more data (double) Working with Sloan Digital Sky Survey 40 TB of images 3 TB of “objects” Now What?
NT Clusters (Wolfpack) • Scale DOWN to PDA: WindowsCE • Scale UP an SMP: TerraServer • Scale OUT with a cluster of machines • Single-system image • Naming • Protection/security • Management/load balance • Fault tolerance • “Wolfpack” • Hot pluggable hardware & software
Browser Server 1 Web site Database Symmetric Virtual Server Failover Example Server 1 Server 2 Web site Web site Database Database Web site files Web site files Database files Database files
Clusters & BackOffice • Research: Instant & Transparent failover • Making BackOffice PlugNPlay on Wolfpack • Automatic install & configure • Virtual Server concept makes it easy • simpler management concept • simpler context/state migration • transparent to applications • SQL 6.5E & 7.0 Failover • MSMQ (queues), MTS (transactions).
1.2 B tpd • 1 B tpd ran for 24 hrs. • Out-of-the-box software • Off-the-shelf hardware • AMAZING! • Sized for 30 days • Linear growth • 5 micro-dollars per transaction
The Memory Hierarchy • Measued & Modeled Sequential IO • Where is the bottleneck? • How does it scale with • SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory App address space Mem bus File cache Controller Adapter SCSI PCI
40 MB/sec Advertised UW SCSI 35r-23w MB/sec Actual disk transfer 29r-17w MB/sec 64 KB request (NTFS) 9 MB/sec Single disk media 3 MB/sec 2 KB request (SQL Server) Measured hardware & Software Find software fixes.. “out of the box” 1/2 power point: 50% of peak power“out of the box” Sequential IO your mileage will vary
PAP (peak advertised Performance) vsRAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) • http://research.Microsoft.com/BARC/Sequential_IO/
Adapter ~30 MBps PCI ~70 MBps Adapter Memory Read/Write ~150 MBps Adapter PCI Adapter Disk Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write 120 MBps
Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. • Hardware and Software cost • Depreciated over 3 years • 1M$ system gets about 1 second, • 1K$ system gets about 1,000 seconds. • Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is • 100-byte records (random data) • key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories
PennySort • Hardware • 266 Mhz Intel P2 • 64 MB SDRAM (10ns) • Dual UDMA 3.2GB EIDE disk • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec or 100 sec