1 / 55

BARC

BARC. BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/.

kellyadam
Download Presentation

BARC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BARC BARCMicrosoft Bay Area Research Center6/20/97 (PUBLIC VERSION)Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingenhttp://www.research.Microsoft.com/barc/

  2. Telepresence • The next killer app • Space shifting: • Reduce travel • Time shifting: • Retrospective • Offer condensations • Just in time meetings. • Example: ACM 97 • NetShow and Web site. • More web visitors than attendees • People-to-People communication

  3. Telepresent Jim GemmellScaleable Reliable Multicast Outline • What Reliable Multicast is & why it is hard to scale • Fcast file transfer • ECSRM • Layered Telepresentations

  4. Sender must repeat Link sees repeats Multiple Unicast

  5. Pruned broadcast Unreliable IP Multicast

  6. Difficult to scale: Sender state explosion Message implosion Reliable Multicast State:receiver 1, receiver 2, … receiver n

  7. Receiver’s job to NACK Receiver-Reliable State:receiver 1, receiver 2, … receiver n

  8. SRM Approaches • Hierarchy / local recovery • Forward Error Correction (FEC) • Suppression • *HYBRID* _________________________________ • Fcast is FEC only • ECSRM is suppression + FEC

  9. Original packets Encode (copy 1st k) 1 2 k 1 2 k k+1 k+2 n Take any k Decode 1 2 k Original packets (n,k) linear block encoding

  10. Fcast • File tranfer protocol • FEC-only • Files transmitted in parallel

  11. Need k from each row 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 k k k k k k k k k k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+1 k+2 k+2 k+2 k+2 k+2 k+2 k+2 k+2 k+2 n n n n n n n n n File 1 File 2 Fcast send order

  12. Files + FEC Files + FEC Files + FEC Files + FEC Sender Low loss receiver join leave High loss receiver join leave Fcast reception time

  13. Fcast demo

  14. ECSRM - Erasure Correcting SRM • Combines: • suppression • erasure correction

  15. Suppression • Delay a NACK or repair in the hopes that someone else will do it. • NACKs are multicast • After NACKing, re-set timer and wait for repair • If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

  16. ECSRM - adding FEC to suppression • Assign each packet to an EC group of size k • NACK: (group, # missing) • NACK of (g,c) suppresses all (g,xc). • Don’t re-send originals; send EC packets using (n,k) encoding

  17. X 1234567 1234567 1234567 1234567 1234567 1234567 1234567 X X X X X X Example: EC group size (k) = 7 1234567

  18. NACK: Group 1, 1 lost NACK’s suppressed …example

  19. …example Erasure correcting packet

  20. Normal suppression needs: 7 NACKs, 7 repairs ECSRM requires: 1 NACK, 1 repair Large group: each packet lost by someone Without FEC, 1/2 of traffic is repairs With ECSRM, only 1/8 of traffic is repairs NACK traffic reduced by factor of 7 …example: summary

  21. Simulation: 112 receivers

  22. Simulation: 112 receivers

  23. Control information ECSRM Slides Annotations Fcast slide master Multicast PowerPoint Add-in

  24. Multicast PowerPoint - Late Joiners • Viewers joining late don’t impact others with session persistent data (slide master) Fcast join leave ECSRM join time

  25. Future Work • Adding hierarchy (e.g. PGM by Cisco) • Do we need 2 protocols?

  26. RAGS: RAndom SQL test Generator • Microsoft spends a LOT of money on testing.(60% of development according to one source). • Idea: test SQL by • generating random correct queries • executing queries against database • compare results with SQL 6.5, DB2, Oracle, Sybase • Being used in SQL 7.0 testing. • 375 unique bugs found (since 2/97) • Very productive test tool

  27. SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) Sample Rags Generated Statement This Statement yields an error:SQLState=37000, Error=8623Internal Query Processor Error:Query processor could not produce a query plan.

  28. Automation • Simpler Statement with same error SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) • Control statement attributes • complexity, kind, depth, ... • Multi-user stress tests • tests concurrency, allocation, recovery

  29. One 4-Vendor Rags Test3 of them vs Us • 60 k Selects on MSS, DB2, Oracle, Sybase. • 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements. • Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new • Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)

  30. RAGS Next Steps • Done: • Patents, • Papers, • talks, • tech transfer to development • Next steps: • Extend to other parts of SQL and Tsql • “Crawl” the config space (look for new holes) • Apply ideas to other domains (ole db).

  31. Scaleup - Big Database • Build a 1 TB SQL Server database • Show off Windows NT and SQL Server scalability • Stress test the product • Data must be • 1 TB • Unencumbered • Interesting to everyone everywhere • And not offensive to anyone anywhere • Loaded • 1.1 M place names from Encarta World Atlas • 1 M Sq Km from USGS (1 meter resolution) • 2 M Sq Km from Russian Space agency (2 m) • Will be on web (world’s largest atlas) • Sell images with commerce server. • USGS CRDA: 3 TB more coming.

  32. SPIN-2 The System • DEC Alpha + 8400 • 324 StorageWorks Drives (2.9 TB) • SQL Server 7.0 • USGS 1-meter data (30% of US) • Russian Space dataTwo meterresolutionimages

  33. 324 disks (2.9 terabytes) 8 x 440Mhz Alpha CPUs 10 GB DRAM World’s Largest PC!

  34. Hardware SPIN-2 1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )

  35. Software Terra-Server Web Site Web Client ImageServer Active Server Pages Internet InformationServer 4.0 HTML JavaViewer The Internet broswer MTS Terra-ServerStored Procedures Internet InfoServer 4.0 Internet InformationServer 4.0 Sphinx (SQL Server) MicrosoftSite Server EE Microsoft AutomapActiveX Server Automap Server Image DeliveryApplication SQL Server7 Terra-Server DB Image Provider Site(s)

  36. System Management & Maintenance • Backup and Recovery • STC 9717 Tape robot • Legato NetWorker™ • Sphinx Backup/Restore Utility • Clocked at 80 MBps (peak)(~ 200 GB/hr) • SQL Server Enterprise Mgr • DBA Maintenance • SQL Performance Monitor

  37. H: G: E: F: TerraServer File Group Layout • Convert 324 disks to 28 RAID5 setsplus 28 spare drives • Make 4 NT volumes (RAID 50)595 GB per volume • Build 30 20GB files on each volume • DB is File Group of 120 files

  38. Demo Http://TerraWeb2

  39. Technical ChallengeKey idea • Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server) • Solution: • Geo-spatial search key: • Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) • Z-transform X & Y into single Z value, build B-tree on Z • Adjacent images stored next to each other • Search Method: • Latitude and Longitude => X, Y, then Z • Select on matching Z value

  40. New Since S-Day: More data: 4.8 TB USGS DOQ .5 TB Russian Bigger Server: Alpha 8400 8 proc, 8 GB RAM, 2.5 TB Disk Improved Application Better UI Uses ASP Commerce App Cut images and Load Sept & Feb Built Commerce App for USGS & Spin-2 Launch at Fed Scalability DaySQL 7 Beta 3 (6/24/98) Operate on Internet for 18 months Add more data (double) Working with Sloan Digital Sky Survey 40 TB of images 3 TB of “objects” Now What?

  41. NT Clusters (Wolfpack) • Scale DOWN to PDA: WindowsCE • Scale UP an SMP: TerraServer • Scale OUT with a cluster of machines • Single-system image • Naming • Protection/security • Management/load balance • Fault tolerance • “Wolfpack” • Hot pluggable hardware & software

  42. Browser Server 1 Web site Database Symmetric Virtual Server Failover Example Server 1 Server 2 Web site Web site Database Database Web site files Web site files Database files Database files

  43. Clusters & BackOffice • Research: Instant & Transparent failover • Making BackOffice PlugNPlay on Wolfpack • Automatic install & configure • Virtual Server concept makes it easy • simpler management concept • simpler context/state migration • transparent to applications • SQL 6.5E & 7.0 Failover • MSMQ (queues), MTS (transactions).

  44. 1.2 B tpd • 1 B tpd ran for 24 hrs. • Out-of-the-box software • Off-the-shelf hardware • AMAZING! • Sized for 30 days • Linear growth • 5 micro-dollars per transaction

  45. The Memory Hierarchy • Measued & Modeled Sequential IO • Where is the bottleneck? • How does it scale with • SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory App address space Mem bus File cache Controller Adapter SCSI PCI

  46. 40 MB/sec Advertised UW SCSI 35r-23w MB/sec Actual disk transfer 29r-17w MB/sec 64 KB request (NTFS) 9 MB/sec Single disk media 3 MB/sec 2 KB request (SQL Server) Measured hardware & Software Find software fixes.. “out of the box” 1/2 power point: 50% of peak power“out of the box” Sequential IO your mileage will vary

  47. PAP (peak advertised Performance) vsRAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) • http://research.Microsoft.com/BARC/Sequential_IO/

  48. Adapter ~30 MBps PCI ~70 MBps Adapter Memory Read/Write ~150 MBps Adapter PCI Adapter Disk Bottleneck Analysis • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not measured, we had only one PCI bus available, 2nd one was “internal”) ~ 120 MBps Unbuffered read ~ 80 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write 120 MBps

  49. Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. • Hardware and Software cost • Depreciated over 3 years • 1M$ system gets about 1 second, • 1K$ system gets about 1,000 seconds. • Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is • 100-byte records (random data) • key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories

  50. PennySort • Hardware • 266 Mhz Intel P2 • 64 MB SDRAM (10ns) • Dual UDMA 3.2GB EIDE disk • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec or 100 sec

More Related