Designing for 20TB Disk Drives And “enterprise storage”

Designing for 20TB Disk DrivesAnd “enterprise storage” Jim Gray, Microsoft research

Kilo Mega Giga Tera Peta Exa Zetta Yotta Disk Evolution • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 TB? in 2012?! • System on a chip • High-speed SAN • Disk replacing tape • Disk is super computer!

Disks are becoming computers • Smart drives • Camera with micro-drive • Replay / Tivo / Ultimate TV • Phone with micro-drive • MP3 players • Tablet • Xbox • Many more… ApplicationsWeb, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio…

Intermediate Step: Shared Logic Snap ~1TB 12x80GB NAS • Brick with 8-12 disk drives • 200 mips/arm (or more) • 2xGbpsEthernet • General purpose OS • 10k$/TB to 100k$/TB • Shared • Sheet metal • Power • Support/Config • Security • Network ports • These bricks could run applications (e.g. SQL or Mail or..) NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS IBM TotalStorage ~360GB 10x36GB NAS

Hardware • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB • 3 weeks from ordering to operational Slide courtesy of Brewster Kahle, @ Archive.org

Disk as Tape • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Kahle, @ Archive.org

Disk As Tape: What format? • Today I send NTFS/SQL disks. • But that is not a good format for Linux. • Solution: Ship NFS/CIFS/ODBC servers (not disks) • Plug “disk” into LAN. • DHCP then file or DB server via standard interface. • Web Service in long term

State is Expensive • Stateless clones are easy to manage • App servers are middle tier • Cost goes to zero with Moore’s law. • One admin per 1,000 clones. • Good story about scaleout. • Stateful servers are expensive to manage • 1TB to 100TB per admin • Storage cost is going to zero(2k$ to 200k$). • Cost of storage is management cost

Databases (== SQL) • VLDB survey (Winter Corp). • 10 TB to 100TB DBs. • Size doubling yearly • Riding disk Moore’s law • 10,000 disks at 18GB is 100TB cooked. • Mostly DSS and data warehouses. • Some media managers

Interesting facts • No DBMSs beyond 100TB. • Most bytes are in files. • The web is file centric • eMail is file centric. • Science (and batch) is file centric. • But…. • SQL performance is better than CIFS/NFS.. • CISC vs RISC

BarBar: the biggest DB • 500 TB • Uses Objectivity™ • SLAC events • Linux cluster scans DB looking for patterns

300 TB (cooked)Hotmail / Yahoo • Clone front ends ~10,000@hotmail. • Application servers • ~100 @ hotmail • Get mail box • Get/put mail • Disk bound • ~30,000 disks • ~ 20 admins

AOL (msn) (1PB?) • 10 B transactions per day (10% of that) • Huge storage • Huge traffic • Lots of eye candy • DB used for security/accounting. • GUESS AOL is a petabyte • (40M x 10MB = 400 x 1012)

Google1.5PB as of last spring • 8,000 no-name PCs • Each 1/3U, 2 x 80 GB disk, 2 cpu 256MB ram • 1.4 PB online. • 2 TB ram online • 8 TeraOps • Slice-price is 1K$ so 8M$. • 15 admins (!) (== 1/100TB).

Astronomy • I’ve been trying to apply DB to astronomy • Today they are at 10TB per data set • Heading for Petabytes • Using Objectivity • Trying SQL (talk to me offline)

Scale Out: Buy Computing by the Slice709,202 tpmC! == 1 Billion transactions/day • Slice: 8cpu, 8GB, 100 disks (=1.8TB) 20ktpmC per slice, ~300k$/slice • clients and 4 DTC nodes not shown

ScaleUp: A Very Big System! • UNISYS Windows 2000 Data Center Limited Edition • 32 cpus on • 32 GB of RAM and • 1,061 disks (15.5 TB) • Will be helped by 64bit addressing 24 fiber channel

2200 2200 2200 E E J J O O 2200 2200 2200 G F P Q K L 2200 2200 2200 R S M N H I Hardware 8 Compaq DL360 “Photon” Web Servers One SQL database per rack Each rack contains 4.5 tb 261 total drives / 13.7 TB total Fiber SAN Switches Meta Data Stored on 101 GB “Fast, Small Disks”(18 x 18.2 GB) SQL\Inst1 Imagery Data Stored on 4 339 GB “Slow, Big Disks” (15 x 73.8 GB) SQL\Inst2 SQL\Inst3 To Add 90 72.8 GB Disks in Feb 2001 to create 18 TB SAN Spare 4 Compaq ProLiant 8500 Db Servers

Amdahl’s Balance Laws • parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S. • balanced system law: A system needs a bit of IO per second per instruction per second:about 8 MIPS per MBps. • memory law:=1:the MB/MIPS ratio (called alpha ()), in a balanced system is 1. • IO law: Programs do one IO per 50,000 instructions.

Amdahl’s Laws Valid 35 Years Later? • Parallelism law is algebra: so SURE! • Balanced system laws? • Look at tpc results (tpcC, tpcH) at http://www.tpc.org/ • Some imagination needed: • What’s an instruction (CPI varies from 1-3)? • RISC, CISC, VLIW, … clocks per instruction,… • What’s an I/O?

MHz/ cpu CPI mips KB/ IO IO/s/ disk Disks Disks/ cpu MB/s/ cpu Ins/ IO Byte Amdahl 1 1 1 6 8 TPC-C= random 550 2.1 262 8 100 397 50 40 7 TPC-H= sequential 550 1.2 458 64 100 176 22 141 3 TPC systems • Normalize for CPI (clocks per instruction) • TPC-C has about 7 ins/byte of IO • TPC-H has 3 ins/byte of IO • TPC-H needs ½ as many disks, sequential vs random • Both use 9GB 10 krpm disks (need arms, not bytes)

TPC systems: What’s alpha (=MB/MIPS)? Hard to say: • Intel 32 bit addressing (= 4GB limit). Known CPI. • IBM, HP, Sun have 64 GB limit. Unknown CPI. • Look at both, guess CPI for IBM, HP, Sun • Alpha is between 1 and 6

Performance (on current SDSS data) • Run times: on 15k$ COMPAQ Server (2 cpu, 1 GB , 8 disk) • Some take 10 minutes • Some take 1 minute • Median ~ 22 sec. • Ghz processors are fast! • (10 mips/IO, 200 ins/byte) • 2.5 m rec/s/cpu ~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

How much storage do we need? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 512MB and 200$/GB • Disk: GB and $/GB: today at 80GB and 70k$/TB • Tape: TB and $/TB: today at 40GB and 10k$/TB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 15 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1-10 GB/s • Disk: 10-50 MB/s - - -Arrays can go to 10GB/s • Tape: 5-15 MB/s - - - Arrays can go to 1GB/s

New Storage Metrics: Kaps, Maps, SCAN • Kaps: How many kilobyte objects served per second • The file server, transaction processing metric • This is the OLD metric. • Maps: How many megabyte objects served per sec • The Multi-Media metric • SCAN: How long to scan all the data • the data mining and utility metric • And • Kaps/$, Maps/$, TBscan/$

100 GB 30 MB/s More Kaps and Kaps/$ but…. • Disk accesses got much less expensive Better disks Cheaper disks! • But: disk arms are expensivethe scarce resource • 1 hour Scanvs 5 minutes in 1990

Data on Disk Can Move to RAM in 10 years 100:1 10 years

The “Absurd” 10x (=4 year) Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?).A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 50k$ today, 5k$ in a few years. • Admin cost >> storage cost !!!! • Challenge: • Automate ALL storage admin tasks

How to cool disk data: • Cache data in main memory • See 5 minute rule later in presentation • Fewer-larger transfers • Larger pages (512-> 8KB -> 256KB) • Sequential rather than random access • Random 8KB IO is 1.5 MBps • Sequential IO is 30 MBps (20:1 ratio is growing) • Raid1 (mirroring) rather than Raid5 (parity).

Data delivery costs 1$/GB today • Rent for “big” customers: 300$/megabit per second per month • Improved 3x in last 6 years (!). • That translates to 1$/GB at each end. • You can mail a 160 GB disk for 20$. • That’s 16x cheaper • If overnight it’s 4 MBps. 3x160 GB ~ ½ TB

Designing for 20TB Disk Drives And “enterprise storage”