Rules of Thumb in Data Engineering

Rules of Thumb in Data Engineering Jim Gray International Conference on Data Engineering San Diego, CA 4 March 2000 Gray@Microsoft.com, http://research.Microsoft.com/~Gray/Talks/

Credits & Thank You!! • Prashant Shenoy U. Mass, Amherst analysis of web caching rules. shenoy@cs.umass.edu • Terrance Kelly, U. Michigan,lots of advice on fixing the paper, tpkelly@mynah.eecs.umich.eduinteresting work on caching at:http://ai.eecs.umich.edu/~tpkelly/papers/wcp.pdf • Dave Lomet, Paul Larson, Surajit Chaudhurihow big should database pages be? • Remzi Arpaci-Dusseau, Kim Keeton, Erik Riedeldiscussions about balanced systems an IO • Windsor Hsu, Alan Smith, & Honesty Young,also studied TPC-C and balanced systems (very nice work!)http://golem.cs.berkeley.edu/~windsorh/DBChar/ • Anastassia Ailamaki, Kim Keetoncpi measurements • Gordon Belldiscussions on balanced systems.

and Apology….. • Printed/Published paper has MANY bugs! • Conclusions OK (sort of ), but typos, flaws, errors,… • Revised version at http://research.microsoft.com/~Gray/ and in CoRR and MS Research tech report archive.By 15 March 2000. • Sorry! Sorry! Woops!

Outline • Moore’s Law and consequences • Storage rules of thumb • Balanced systems rules revisited • Networking rules of thumb • Caching rules of thumb

Trends: Moore’s Law • Performance/Price doubles every 18 months • 100x per decade • Progress in next 18 months = ALL previous progress • New storage = sum of all old storage (ever) • New processing = sum of all old processing. • E. coli double ever 20 minutes! 15 years ago

Trends: ops/s/$ Had Three Growth Phases 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doubling

Trends: Gilder’s Law: 3x bandwidth/year for 25 more years • Today: • 10 Gbps per channel • 4 channels per fiber: 40 Gbps • 32 fibers/bundle = 1.2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth • Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

Trends: Magnetic Storage Densities • Amazing progress • Ratios have changed • Capacity grows 60%/y • Access speed grows 10x more slowly

Trends: Density Limits Density vs Time b/µm2 & Gb/in2 Bit Density • The end is near! • Products:11 GbpsiLab: 35 Gbpsi“limit”: 60 Gbpsi • Butlimit keeps rising& there are alternatives b/µm2 Gb/in2 ?: NEMS, Florescent? Holograpic, DNA? 3,000 2,000 1,000 600 300 200 SuperParmagnetic Limit 100 60 30 20 Wavelength Limit ODD 10 6 DVD 3 2 CD 1 0.6 Figure adapted from Franco Vitaliano, “The NEW new media: the growing attraction of nonmagnetic storage”, Data Storage, Feb 2000, pp 21-32, www.datastorage.com 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Trends: promises NEMS (Nano Electro Mechanical Systems)(http://www.nanochip.com/) also Cornell, IBM, CMU,… • 250 Gbpsi by using tunneling electronic microscope • Disk replacement • Capacity: 180 GB now, 1.4 TB in 2 years • Transfer rate: 100 MB/sec R&W • Latency: 0.5msec • Power: 23W active, .05W Standby • 10k$/TB now, 2k$/TB in 2002

Consequence of Moore’s law:Need an address bit every 18 months. • Moore’s law gives you 2x more in 18 months. • RAM • Today we have 10 MB to 100 GB machines(24-36 bits of addressing) then • In 9 years we will need 6 more bits: 30-42 bit addressing (4TB ram). • Disks • Today we have 10 GB to 100 TB file systems/DBs(33-47 bit file addresses) • In 9 years, we will need 6 more bits40-53 bit file addresses (100 PB files)

Architecture could change this • 1-level store: • System 48, AS400 has 1-level store. • Never re-uses an address. • Needs 96-bit addressing today. • NUMAs and Clusters • Willing to buy a 100 M$ computer? • Then add 6 more address bits. • Only 1-level store pushes us beyond 64-bits • Still, these are “logical” addresses, 64-bit physical will last many years

Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Olympia 100 Memory This Hotel 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min

15 2 10 10 12 0 10 10 9 -2 10 10 6 -4 10 10 3 -6 10 10 Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Price vs Speed Size vs Speed Nearline Cache Tape Offline Main Tape Disc Secondary Online Online Secondary $/MB Tape Tape Disc Typical System (bytes) Main Offline Nearline Tape Tape Cache -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds)

Disks: Today • Disk is 8GB to 80 GB10-30 MBps5k-15k rpm (6ms-2ms rotational latency)12ms-7ms seek7K$/IDE-TB, 20k$/SCSI-TB • For shared disks most time spent waiting in queue for access to arm/controller Wait Transfer Transfer Rotate Rotate Seek Seek

Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 512MB and 3$/MB • Disk: GB and $/GB: today at 40GB and 20$/GB • Tape: TB and $/TB: today at 40GB and 10k$/TB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 15 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1-10 GB/s • Disk: 20-30 MB/s - - -Arrays can go to 10GB/s • Tape: 5-15 MB/s - - - Arrays can go to 1GB/s

New Storage Metrics: Kaps, Maps, SCAN • Kaps: How many kilobyte objects served per second • The file server, transaction processing metric • This is the OLD metric. • Maps: How many megabyte objects served per sec • The Multi-Media metric • SCAN: How long to scan all the data • the data mining and utility metric • And • Kaps/$, Maps/$, TBscan/$

10x better access time 10x more bandwidth 100x more capacity Data 25x cooler (1Kaps/20MB vs 1Kaps/500MB) 4,000x lower media price 20x to 100x lower disk price Scan takes 10x longer (3 min vs 45 min) DRAM/disk media price ratio changed 1970-1990 100:1 1990-1995 10:1 1995-1997 50:1 today ~ 0.03$/MB disk 100:1 3$/MB dram Storage Ratios Changed

Data on Disk Can Move to RAM in 10 years 100:1 10 years

100 GB 30 MB/s More Kaps and Kaps/$ but…. • Disk accesses got much less expensive Better disks Cheaper disks! • But: disk arms are expensivethe scarce resource • 45 minute Scanvs 5 minutes in 1990

Disk 40 GB 20 MBps 5 ms seek time 3 ms rotate latency 7$/GB for drive 3$/GB for ctlrs/cabinet 4 TB/rack 1 hour scan Tape 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media8$/GB for drive+library 10 TB/rack 1 week scan Disk vs Tape Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape.

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?).A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

The “Absurd” 10x (=5 year) Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

How to cool disk data: • Cache data in main memory • See 5 minute rule later in presentation • Fewer-larger transfers • Larger pages (512-> 8KB -> 256KB) • Sequential rather than random access • Random 8KB IO is 1.5 MBps • Sequential IO is 30 MBps (20:1 ratio is growing) • Raid1 (mirroring) rather than Raid5 (parity).

Stripes, Mirrors, Parity (RAID 0,1, 5) • RAID 0: Stripes • bandwidth • RAID 1: Mirrors, Shadows,… • Fault tolerance • Reads faster, writes 2x slower • RAID 5: Parity • Fault tolerance • Reads faster • Writes 4x or 6x slower. 0,3,6,.. 1,4,7,.. 2,5,8,.. 0,1,2,.. 0,1,2,.. 0,2,P2,.. 1,P1,4,.. P0,3,5,..

RAID 5 (6 disks 1 vol): Performance 675 reads/sec 210 writes/sec Write 4 logical IO, 2 seek + 1.7 rotate SAVES SPACE Performance degrades on failure RAID1 (6 disks, 3 pairs) Performance 750 reads/sec 300 writes/sec Write 2 logical IO 2 seek 0.7 rotate SAVES ARMS Performance improves on failure RAID 10 (strips of mirrors) Wins“wastes space, saves arms”

Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 60k$ today, 10k$ in a few years. • Admin cost >> storage cost !!!! • Challenge: • Automate ALL storage admin tasks

Summarizing storage rules of thumb (1) • Moore’s law: 4x every 3 years 100x more per decade • Implies 2 bit of addressing every 3 years. • Storage capacities increase 100x/decade • Storage costs drop 100x per decade • Storage throughput increases 10x/decade • Data cools 10x/decade • Disk page sizes increase 5x per decade.

Summarizing storage rules of thumb (2) • RAM:Disk and Disk:Tape cost ratios are 100:1 and 3:1 • So, in 10 years, disk data can move to RAM since prices decline 100x per decade. • A person can administer a million dollars of disk storage: that is 1TB - 100TB today • Disks are replacing tapes as backup devices.You can’t backup/restore a Petabyte quicklyso geoplex it. • Mirroring rather than Parity to save disk arms

System Bus PCI Bus 1 PCI Bus 2 Standard Architecture (today)

Amdahl’s Balance Laws • parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S. • balanced system law: A system needs a bit of IO per second per instruction per second:about 8 MIPS per MBps. • memory law:=1:the MB/MIPS ratio (called alpha ()), in a balanced system is 1. • IO law: Programs do one IO per 50,000 instructions.

Amdahl’s Laws Valid 35 Years Later? • Parallelism law is algebra: so SURE! • Balanced system laws? • Look at tpc results (tpcC, tpcH) at http://www.tpc.org/ • Some imagination needed: • What’s an instruction (CPI varies from 1-3)? • RISC, CISC, VLIW, … clocks per instruction,… • What’s an I/O?

MHz/ cpu CPI mips KB/ IO IO/s/ disk Disks Disks/ cpu MB/s/ cpu Ins/ IO Byte Amdahl 1 1 1 6 8 TPC-C= random 550 2.1 262 8 100 397 50 40 7 TPC-H= sequential 550 1.2 458 64 100 176 22 141 3 TPC systems • Normalize for CPI (clocks per instruction) • TPC-C has about 7 ins/byte of IO • TPC-H has 3 ins/byte of IO • TPC-H needs ½ as many disks, sequential vs random • Both use 9GB 10 krpm disks (need arms, not bytes)

TPC systems: What’s alpha (=MB/MIPS)? Hard to say: • Intel 32 bit addressing (= 4GB limit). Known CPI. • IBM, HP, Sun have 64 GB limit. Unknown CPI. • Look at both, guess CPI for IBM, HP, Sun • Alpha is between 1 and 6

Amdahl’s Balance Laws Revised • Laws right, just need “interpretation” (imagination?) • Balanced System Law:A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload. • Sequential workloads have low CPI (clocks per instruction), • random workloads tend to have higher CPI. • Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue. • One Random IO’s per 50k instructions. • Sequential IOs are larger One sequential IO per 200k instructions

Application Data File System CPU System Bus 550 x4 Mips = 2 Bips 1600 MBps 1-3 cpi = 170-550 mips 500 MBps PCI System Bus 133 MBps PCI Bus 1 PCI Bus 2 90 MBps SCSI 160 MBps 90 MBps Disks 66 MBps 25 MBps PAP vs RAP • Peak Advertised Performance vs Real Application Performance

Ubiquitous 10 GBps SANs in 5 years • 1Gbps Ethernet are reality now. • Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,… • 10 Gbps x4 WDM deployed now (OC192) • 3 Tbps WDM working in lab • In 5 years, expect 10x, wow!! 1 GBps 120 MBps (1Gbps) 80 MBps 5 MBps 40 MBps 20 MBps

Networking • WANS are getting faster than LANSG8 = OC192 = 8Gbps is “standard” • Link bandwidth improves 4x per 3 years • Speed of light (60 ms round trip in US) • Software stackshave always been the problem. Time = SenderCPU + ReceiverCPU + bytes/bandwidth This has been the problem

The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ • Yesterday: • 10 MBps (100 Mbps Ethernet) • ~20 MBps tcp/ip saturates 2 cpus • round-trip latency ~250 µs • Now • Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… • Fast user-level communication • tcp/ip ~ 100 MBps 10% cpu • round-trip latency is 15 us • 1.6 Gbps demoed on a WAN

How much does wire-time cost?$/Mbyte? Cost Time • Gbps Ethernet .2µ$ 10 ms • 100 Mbps Ethernet .3µ$ 100 ms • OC12 (650 Mbps) .003$ 20 ms • DSL .0006$ 25 sec • POTs .002$ 200 sec • Wireless: .80$ 500 sec

The Five Minute Rule • Trade DRAM for Disk Accesses • Cost of an access (Drive_Cost / Access_per_second) • Cost of a DRAM page ( $/MB/ pages_per_MB) • Break even has two terms: • Technology term and an Economic term • Grew page size to compensate for changing ratios. • Now at 5 minutes for random, 10 seconds sequential

The 5 Minute Rule Derived Breakeven: RAM_$_Per_MB = _____DiskPrice . PagesPerMB T x AccessesPerSecond Disk Access Cost /T DiskPrice . AccessesPerSecond ( )/T Cost a RAM Page RAM_$_Per_MB PagesPerMB $ T =TimeBetweenReferences to Page T = DiskPrice x PagesPerMB . RAM_$_Per_MB x AccessPerSecond

Plugging in the Numbers • Trend is longer times because disk$ not changing much, RAM$ declining 100x/decade 5 Minutes & 10 second rule

When to Cache Web Pages. • Caching saves user time • Caching saves wire time • Caching costs storage • Caching only works sometimes: • New pages are a miss • Stale pages are a miss

The 10 Instruction Rule • Spend 10 instructions /second to save 1 byte • Cost of instruction: I =ProcessorCost/MIPS*LifeTime • Cost of byte: B = RAM_$_Per_B/LifeTime • Breakeven: NxI = B N = B/I = (RAM_$_B X MIPS)/ ProcessorCost ~ (3E-6x5E8)/500 = 3 ins/B for Intel ~ (3E-6x3E8)/10 = 10 ins/B for ARM

Web Page Caching Saves People Time • Assume people cost 20$/hour (or .2 $/hr ???) • Assume 20% hit in browser, 40% in proxy • Assume 3 second server time • Caching saves people time 28$/year to 150$/year of people time or .28 cents to 1.5$/year.

Rules of Thumb in Data Engineering