Rules of Thumb in Data Engineering

Rules of Thumb in Data Engineering Jim Gray UC Santa Cruz 7 May 2002 Gray@Microsoft.com, http://research.Microsoft.com/~Gray/Talks/

Outline • Moore’s Law and consequences • Storage rules of thumb • Balanced systems rules revisited • Networking rules of thumb • Caching rules of thumb

Meta-Message: Technology Ratios Matter • Price and Performance change. • If everything changes in the same way, then nothing really changes. • If some things get much cheaper/faster than others, then that is real change. • Some things are not changing much: • Cost of people • Speed of light • … • And some things are changing a LOT

Trends: Moore’s Law • Performance/Price doubles every 18 months • 100x per decade • Progress in next 18 months = ALL previous progress • New storage = sum of all old storage (ever) • New processing = sum of all old processing. • E. coli double ever 20 minutes! 15 years ago

Trends: ops/s/$ Had Three Growth Phases 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doubling

So: a problem • Suppose you have a ten-year compute job on the world’s fastest supercomputer. What should you do. • ? Commit 250M$ now? • ? Program for 9 years Software speedup: 26 = 64x Moore’s law speedup: 26 = 64x so 4,000x speedup: spend 1M$ (not 250M$ on hardware) runs in 2 weeks, not 10 years. • Homework problem: What is the optimum strategy?

Storage capacity beating Moore’s law 1 k$/TB today (raw disk) 100$/TB by end of 2007

Trends: Magnetic Storage Densities • Amazing progress • Ratios have changed • Improvements:Capacity 60%/yBandwidth 40%/yAccess time 16%/y

Trends: Density Limits Density vs Time b/µm2 & Gb/in2 Bit Density • The end is near! • Products:23 GbpsiLab: 50 Gbpsi“limit”: 60 Gbpsi • Butlimit keeps rising& there are alternatives b/µm2 Gb/in2 ?: NEMS, Florescent? Holographic, DNA? 3,000 2,000 1,000 600 300 200 SuperParmagnetic Limit 100 60 30 20 Wavelength Limit ODD 10 6 DVD 3 2 CD 1 0.6 Figure adapted from Franco Vitaliano, “The NEW new media: the growing attraction of nonmagnetic storage”, Data Storage, Feb 2000, pp 21-32, www.datastorage.com 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008

Trends: promises NEMS (Nano Electro Mechanical Systems)(http://www.nanochip.com/) also Cornell, IBM, CMU,… • 250 Gbpsi by using tunneling electronic microscope • Disk replacement • Capacity: 180 GB now, 1.4 TB in 2 years • Transfer rate: 100 MB/sec R&W • Latency: 0.5msec • Power: 23W active, .05W Standby • 10k$/TB now, 2k$/TB in 2004

Consequence of Moore’s law:Need an address bit every 18 months. • Moore’s law gives you 2x more in 18 months. • RAM • Today we have 10 MB to 100 GB machines(24-36 bits of addressing) then • In 9 years we will need 6 more bits: 30-42 bit addressing (4TB ram). • Disks • Today we have 10 GB to 100 TB file systems/DBs(33-47 bit file addresses) • In 9 years, we will need 6 more bits40-53 bit file addresses (100 PB files)

Architecture could change this • 1-level store: • System 48, AS400 has 1-level store. • Never re-uses an address. • Needs 96-bit addressing today. • NUMAs and Clusters • Willing to buy a 100 M$ computer? • Then add 6 more address bits. • Only 1-level store pushes us beyond 64-bits • Still, these are “logical” addresses, 64-bit physical will last many years

Trends: Gilder’s Law: 3x bandwidth/year for 25 more years • Today: • 40 Gbps per channel (λ) • 12 channels per fiber (wdm): 500 Gbps • 32 fibers/bundle = 16 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth • Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

How much storage do we need? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Springfield 100 Memory This Campus 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min

15 2 10 10 12 0 10 10 9 -2 10 10 6 -4 10 10 3 -6 10 10 Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs Price vs Speed Size vs Speed Nearline Cache Tape Offline Main Tape Disc Secondary Online Online Secondary $/MB Tape Tape Disc Typical System (bytes) Main Offline Nearline Tape Tape Cache -9 -6 -3 0 3 -9 -6 -3 0 3 10 10 10 10 10 10 10 10 10 10 Access Time (seconds) Access Time (seconds)

Disks: Today • Disk is 18GB to 180 GB10-50 MBps5k-15k rpm (6ms-2ms rotational latency)12ms-7ms seek1K$/IDE-TB, 6k$/SCSI-TB • For shared disks most time spent waiting in queue for access to arm/controller Wait Transfer Transfer Rotate Rotate Seek Seek

12/1/1999 9/1/2000 9/1/2001 4/1/2002 The Street Price of a Raw disk TB about 1K$/TB

Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 512MB and 200$/GB • Disk: GB and $/GB: today at 80GB and 7k$/TB • Tape: TB and $/TB: today at 40GB and 7k$/TB (nearline) • Access time (latency) • RAM: 1…100 ns • Disk: 5…15 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1-10 GB/s • Disk: 10-50 MB/s - - -Arrays can go to 10GB/s • Tape: 5-15 MB/s - - - Arrays can go to 1GB/s

New Storage Metrics: Kaps, Maps, SCAN • Kaps: How many kilobyte objects served per second • The file server, transaction processing metric • This is the OLD metric. • Maps: How many megabyte objects served per sec • The Multi-Media metric • SCAN: How long to scan all the data • the data mining and utility metric • And • Kaps/$, Maps/$, TBscan/$

For the Record (good 2002 devices packaged in systemhttp://www.tpc.org/results/individual_results/Compaq/compaq.5500.99050701.es.pdf) X 100 Tape slice is 8Tb with 1 DLT reader at 6MBps per 100 tapes.

For the Record (good 2002 devices packaged in systemhttp://www.tpc.org/results/individual_results/Compaq/compaq.5500.99050701.es.pdf) Tape is 1Tb with 4 DLT readers at 5MBps each.

Disk Changes • Disks got cheaper: 20k$ -> 200$ • $/Kaps etc improved 100x (Moore’s law!) (or even 500x) • One-time event (went from mainframe prices to PC prices) • Disks got cooler (50x in decade) • 1990: 1 Kaps per 20 MB • 2002: 1 Kaps per 1,000 MB • Disk scans take longer (10x per decade) • 1990 disk ~ 1GB and 50Kaps and 5 minute scan • 2002 disk ~160GB and 160Kaps and 1 hour scan • So.. Backup/restore takes a long time (too long)

10x better access time 10x more bandwidth 100x more capacity Data 25x cooler (1Kaps/20MB vs 1Kaps/GB) 4,000x lower media price 20x to 100x lower disk price Scan takes 10x longer (3 min vs 1hr) RAM/disk media price ratio changed 1970-1990 100:1 1990-1995 10:1 1995-1997 50:1 today ~ 1$/GB disk 200:1 200$/GB ram Storage Ratios Changed

100 GB 30 MB/s More Kaps and Kaps/$ but…. • Disk accesses got much less expensive Better disks Cheaper disks! • But: disk arms are expensivethe scarce resource • 1 hour Scanvs 5 minutes in 1990

Data on Disk Can Move to RAM in 10 years 100:1 10 years

The “Absurd” 10x (=4 year) Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

Disk 160 GB 40 MBps 4 ms seek time 2 ms rotate latency 1$/GB for drive 1$/GB for ctlrs/cabinet 60 TB/rack 1 hour scan Tape 80 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media5$/GB for drive+library 20 TB/rack 1 week scan Disk vs Tape Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB = 8 drives The price advantage of tape is gone, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape.

Sony DTF-2 is 100 GB, 24 MBps 30 second pick time So, 2x better Prices not clear http://bpgprod.sel.sony.com/DTF/seismic/dtf2.html Caveat: Tape vendors may innovate

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?).A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2002 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is >5k$ today, 500$ in a few years. • Admin cost >> storage cost !!!! • Challenge: • Automate ALL storage admin tasks

How to cool disk data: • Cache data in main memory • See 5 minute rule later in presentation • Fewer-larger transfers • Larger pages (512-> 8KB -> 256KB) • Sequential rather than random access • Random 8KB IO is 1.5 MBps • Sequential IO is 30 MBps (20:1 ratio is growing) • Raid1 (mirroring) rather than Raid5 (parity).

Stripes, Mirrors, Parity (RAID 0,1, 5) • RAID 0: Stripes • bandwidth • RAID 1: Mirrors, Shadows,… • Fault tolerance • Reads faster, writes 2x slower • RAID 5: Parity • Fault tolerance • Reads faster • Writes 4x or 6x slower. 0,3,6,.. 1,4,7,.. 2,5,8,.. 0,1,2,.. 0,1,2,.. 0,2,P2,.. 1,P1,4,.. P0,3,5,..

RAID 5 (6 disks 1 vol): Performance 675 reads/sec 210 writes/sec Write 4 logical IO, 2 seek + 1.7 rotate SAVES SPACE Performance degrades on failure RAID1 (6 disks, 3 pairs) Performance 750 reads/sec 300 writes/sec Write 2 logical IO 2 seek 0.7 rotate SAVES ARMS Performance improves on failure RAID 10 (strips of mirrors) Wins“wastes space, saves arms”

Shows Best Page Index Page Size ~16KB

Summarizing storage rules of thumb (1) • Moore’s law: 4x every 3 years 100x more per decade • Implies 2 bit of addressing every 3 years. • Storage capacities increase 100x/decade • Storage costs drop 100x per decade • Storage throughput increases 10x/decade • Data cools 10x/decade • Disk page sizes increase 5x per decade.

Summarizing storage rules of thumb (2) • RAM:Disk and Disk:Tape cost ratios are 100:1 and 1:1 • So, in 10 years, disk data can move to RAM since prices decline 100x per decade. • A person can administer a million dollars of disk storage: that is 1TB - 100TB today • Disks are replacing tapes as backup devices.You can’t backup/restore a Petabyte quicklyso geoplex it. • Mirroring rather than Parity to save disk arms

System Bus PCI Bus 1 PCI Bus 2 Standard Architecture (today)

Amdahl’s Balance Laws • parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S. • balanced system law: A system needs a bit of IO per second per instruction per second:about 8 MIPS per MBps. • memory law:=1:the MB/MIPS ratio (called alpha ()), in a balanced system is 1. • IO law: Programs do one IO per 50,000 instructions.

Amdahl’s Laws Valid 35 Years Later? • Parallelism law is algebra: so SURE! • Balanced system laws? • Look at tpc results (tpcC, tpcH) at http://www.tpc.org/ • Some imagination needed: • What’s an instruction (CPI varies from 1-3)? • RISC, CISC, VLIW, … clocks per instruction,… • What’s an I/O?

MHz/ cpu CPI mips KB/ IO IO/s/ disk Disks Disks/ cpu MB/s/ cpu Ins/ IO Byte Amdahl 1 1 1 6 8 TPC-C= random 550 2.1 262 8 100 397 50 40 7 TPC-H= sequential 550 1.2 458 64 100 176 22 141 3 TPC systems • Normalize for CPI (clocks per instruction) • TPC-C has about 7 ins/byte of IO • TPC-H has 3 ins/byte of IO • TPC-H needs ½ as many disks, sequential vs random • Both use 9GB 10 krpm disks (need arms, not bytes)

TPC systems: What’s alpha (=MB/MIPS)? Hard to say: • Intel 32 bit addressing (= 4GB limit). Known CPI. • IBM, HP, Sun have 64 GB limit. Unknown CPI. • Look at both, guess CPI for IBM, HP, Sun • Alpha is between 1 and 6

Instructions per IO? • We know 8 mips per MBps of IO • So, 8KB page is 64 K instructions • And 64KB page is 512 K instructions. • But, sequential has fewer instructions/byte. (3 vs 7 in tpcH vs tpcC). • So, 64KB page is 200 K instructions.

Amdahl’s Balance Laws Revised • Laws right, just need “interpretation” (imagination?) • Balanced System Law:A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload. • Sequential workloads have low CPI (clocks per instruction), • random workloads tend to have higher CPI. • Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue. • One Random IO per 50k instructions. • Sequential IOs are larger One sequential IO per 200k instructions

Application Data File System CPU System Bus 2000 x4 Mips = 8 Bips 1600 MBps 1-6 cpi = 500..2,000 mips 500 MBps PCI System Bus 133 MBps PCI Bus 1 PCI Bus 2 90 MBps SCSI 160 MBps 90 MBps Disks 66 MBps 40 MBps PAP vs RAP (a y2k perspective) • Peak Advertised Performance vs Real Application Performance

Standard IO (Infiniband™) next Year? • Probably • Replace PCI with something better will still need a mezzanine bus standard • Multiple serial links directly from processor • Fast (10 GBps/link) for a few meters • System Area Networks (SANS) ubiquitous (VIA morphs to Infiniband?)

Ubiquitous 10 GBps SANs in 5 years • 1Gbps Ethernet are reality now. • Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,… • 10 Gbps x4 WDM deployed now (OC192) • 3 Tbps WDM working in lab • In 5 years, expect 10x, wow!! 1 GBps 120 MBps (1Gbps) 80 MBps 5 MBps 40 MBps 20 MBps

Rules of Thumb in Data Engineering