240 likes | 254 Views
PennySort Award Ceremony Beijing China 23 October 2006. Outline. Penny Sort history and Award What I have been doing. Benchmark History. 1970. IBM TP 1-7 CA and Tony Lukes. Debit Credit Gray. Wisconsin Bitton Boral DeWitt Turbyfill. 1980. Datamation Anon et al. Sort. MCC
E N D
Outline • Penny Sort history and Award • What I have been doing.
Benchmark History 1970 IBM TP 1-7CA and Tony Lukes Debit Credit Gray Wisconsin Bitton Boral DeWitt Turbyfill 1980 Datamation Anon et al Sort MCC Boral &... Teradata Bollinger &... TPC-A 1990 TPC-B TPC-C TPC-D PennySort MinuteSort TPC-H 2000 TPC-W ? 2010
A Short History of Sort • April Fools 1995: Datamation Sort • Sort 1M 100 B records • An IO benchmark: 15-min to 1 hr! • 1993: {Minute | Penny}x{Daytona | Indy} • 1998: TeraByte Sort • Web site: http://research.Microsoft.com/barc/SortBenchmark/
Ground Rules • How much can you sort for a penny (or in a minute). • Hardware cost • Depreciated over 3 years • 1M$ system gets about 1 second, • 1K$ system gets about 1,000 seconds. • Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is • 100-byte records (random data) • key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories
1998 PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE disks • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec
2004 Daytona Terabyte Sort • NEC Express/5800/1320Xd 32x Itanium2 1.5Ghz 128GB 900 disk TPC-C machine • Striped across 20 HBA • Read and write at 3.5 GBps • Sort 34GB in 60 seconds. • Sort 1 TB in 33 minutes Input Phase of 1 TB nSort
2006 Sort Records Daytona Indy Penny 344 million records (32 GB)in 1,679 secondsBytes-Split-Index Sort (BSIS) $760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXPXing Huang and BinHeng SongSchool of Software, Tsinghua U., Beijing, China Bo HuangMath&CS, Hunan U. of Technology, Zhuzhou, China 590 M records ( 55GB)in 644 seconds GpuTeraSort1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXP Naga Govindaraju, Ritesh Kumar , Dinesh Manocha, Jim GrayU. North Carolina at Chapel Hill, USA Minute 40 GB (400 million records)NeoSort pdfMSwordWindows, Fujitsu 32 Itanium2, 128 SAN disksChris Nyberg, Charles KoesterOrdinal Technology ( 2005) 116GB (125 M records) SCSpdf 58.7 seconds Linux, 80 Itanium2, 2,520 SAN disks Jim Wyllie, IBM Almaden Research TeraByte (2004)33 minutesNsort pdf, word, htmWindows, 32 Itanium2, 2,350 SAN disksChris Nyberg, Charles KoesterOrdinal Technology (2005)435 seconds (7.25 minutes)SCS pdf Linux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research 1999 Sort Records
Bytes Split Index Sort (BSIS)Xing Huang & BinHeng Song, Tsinghua Bo Huang, Hunan U. of Technology • A radix-partition sort. • Then merge the partitions. • 344 million records (32 GB) in 1,679 seconds$760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP • Phase 1: 66 MB/s, Phase 28 MB/s • See http://research.microsoft.com/barc/SortBenchmark/BSIS-PennySort_2006.pdf
Records per Second per CPU 1.E+6 slow improvement after 1995 cache conscious 1.E+5 Super 1.E+4 records/sec/cpu 1.E+3 Mini 1.E+2 1.E+1 1985 1990 1995 2000 2005 Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995 http://research.microsoft.com/barc/SortBenchmark/ • Sort recs/s/cpuplateaued in1995
Graphics Req’mts (enhanced experience) Moore’s Law 3 for 18 mo Then Moore’s Law trajectory Leading Edge 31 GHz GPU Cooling (Cost) Limitations Value / UMA Enthusiast / Specialty Moore’s Law Trajectory Mainstream Desktop Log of Relative Processing Power 4.4 GHz Leading Edge 11.2 DT ‘Replacement’ 2.2 GHz 4.2 Mobile CPU 1.6 GHz Value Corporate DT SW Requirements 0.8 GHz 2002 2004 2006 2008 Technology Trends: CPU and GPU CPU ?
Moore’s Wall: Chip Heat Death • Processor power density going to infinity. • Solution: stablize clock at ~5GHzMulti-core (aka MTA) (1,000 core?)
GPU TeraSortNaga Govindaraju, Ritesh Kumar , Dinesh Manocha, U. North Carolina at Chapel Hill • Use GPU for Phase 1 bitonic sort • 590 M records ( 55GB) in 644 seconds 1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXPWindowsXP • Phase 1: 185 MB/s, Phase 150 MB/s • See http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-183
Records per Second per CPU 1.E+6 slow improvement after 1995 cache conscious 1.E+5 Super 1.E+4 records/sec/cpu 1.E+3 GPU better memory architecture, so finally more records/second Mini 1.E+2 1.E+1 1985 1990 1995 2000 2005 Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995 http://research.microsoft.com/barc/SortBenchmark/ • Sort recs/s/cpuplateaued in1995 • Had to get GPU to getbetter Memory bandwidth • SIGMOD 2006GpuTeraSort
Assembly Case, power, 2% fan Motherboard 3% 16% Disks CPU 33% 12% GpuTeraSort GPU 18% Disk controller 6% RAM 10% 2006 PennySort Price Breakdown $760 $1470 BSIS
Sort Performance/Price improved • Based on parallelism and “commodity” not per-cpu performance.
Musings: PennySort=TBsort • 2 pass so 3TB of disk • = 8 disks if 400GB/disk • = 0.5GBps (if each disk = 65 Mbps) • So, 6000 seconds (3TB/5GBps) • So, node can cost 200$ • Costs 10x that today • maybe in 5 years?
Musings: MinuteSort=TBsort • Sorts 1TB in 1Minute • 1 pass so 1TB of ram • 266Gbps bisection bandwidth • 1 pass so 2TB of IO in 60 sec => 600 disks => ~80 nodes: 8 disks 2GB ram=> interconnect with 10Gbps Ethernet • or 300 nodes at 1Gbps Ethernet. • doable today
What I Have Been Doing • Traveling & Talking • Helping Build the SkyServer and the Virtual Observatory • Doing spatial geometry in SQL (no kidding)! • Trying to get all science literature and data online and interlinked. • and… • to blob or not to blob • disk reliability
To Blob or Not To Blob • For objects X smaller than 1MBSelect X into x from T where key = 123faster than h = open(X); read(h,x,n); close(h) • So, blob beats file for objects < 1MB (on SQL Server – what about other DBs?) • Because DB is CISC and FS is RISC • Most things are less than 1MB • DB should work to make this 10MB • File system should borrow ideas from DB. “To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?” Rusty Sears, Catharine Van Ingen, Jim Gray, MSR-TR-2006-45, April 2006
What About Bit Error Rates • Uncorrectable Errors on Read (UERs) • Quoted uncorrectable bit error rates10-13 to 10-15 • That’s 1 error in 1TB to 1 error in 100TB • WOW!!! • We moved 1.5 PB looking for errors • Saw 5 UER events • 3 real, 3 of them were masked by retry • Many controller fails and system security reboots • Conclusion: • UER not a useful metric – want mean time to data loss • UER better than advertised. Empirical Measurements of Disk Failure Rates and Error Rates Jim Gray, Catharine van Ingen, Microsoft Technical Report MSR-TR-2005-166
So, You Want to Copy a Petabyte? • Today, that’s 4,000 disks (read 2k write 2k) • Takes ~4 hours if they run in parallel, but… • Probably not one file. • You will see a few UERs. • What’s the best strategy? • How fast can you move a Petabyte from CERN to Pasadena? Is sneaker-net fastest and cheapest?
UER things I wish I knew • Better statistics from larger farms, and more diversity. • What is the UER on a LAN, WAN? • What is the UER over time: for a file on disk for a disk • What’s the best replication strategy? • Symmetric (1+1)+(1+1) or triplex (1+1) + 1