Surprise-Free Storage Futures

Storage: Alternate Futures Yotta Zetta Exa Peta Tera Giga Mega Kilo Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks IBM Almaden, 1 December 1999

Acknowledgments: Thank You!! • Dave Patterson: • Convinced me that processors are moving to the devices. • Kim Keeton and Erik Riedell • Showed that many useful subtasks can be done by disk-processors, and quantified execution interval • Remzi Dusseau • Re-validated Amdahl's laws

Outline • The Surprise-Free Future (5 years) • 500 mips cpus for 10$ • 1 Gb RAM chips • MAD at 50 Gbpsi • 10 GBps SANs are ubiquitous • 1 GBps WANs are ubiquitous • Some consequences • Absurd (?) consequences. • Auto-manage storage • Raid10 replaces Raid5 • Disc-packs • Disk is the archive media of choice • A surprising future? • Disks (and other useful things) become supercomputers. • Apps run “in the disk”

The Surprise-free Storage Future • 1 Gb RAM chips • MAD at 50 Gbpsi • Drives shrink one quantum • Standard IO • 10 GBps SANs are ubiquitous • 1 Gbps WANs are ubiquitous • 5 bips cpus for 1K$ and 500 mips cpus for 10$

1 Gb RAM Chips • Moving to 256 Mb chips now • 1Gb will be “standard” in 5 years, 4 Gb will be premium product. • Note: • 256Mb = 32MB: the smallest memory • 1 Gb = 128 MB: the smallest memory

System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.

500 mips System On A Chip for 10$ • 486 now 7$ 233 MHz ARM for 10$ system on a chiphttp://www.cirrus.com/news/products99/news-product14.html AMD/Celeron 266 ~ 30$ • In 5 years, today’s leading edge will be • System on chip (cpu, cache, mem ctlr, multiple IO) • Low cost • Low-power • Have integrated IO • High end is 5 BIPS cpus

Standard IO in 5 Years • Probably • Replace PCI with something better will still need a mezzanine bus standard • Multiple serial links directly from processor • Fast (10 GBps/link) for a few meters • System Area Networks (SANS) ubiquitous (VIA morphs to SIO?)

Ubiquitous 10 GBps SANs in 5 years 1 GBps • 1Gbps Ethernet are reality now. • Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,… • 10 Gbps x4 WDM deployed now (OC192) • 3 Tbps WDM working in lab • In 5 years, expect 10x, progress is astonishing • Gilder’s law: Bandwidth grows 3x/yearhttp://www.forbes.com/asap/97/0407/090.htm 120 MBps (1Gbps) 80 MBps 5 MBps 40 MBps 20 Mbsp

Thin Client’s mean HUGE servers • AOL hosting customer pictures • Hotmail allows 5 MB/user, 50 M users • Web sites offer electronic vaulting for SOHO. • IntelliMirror: replicate client state on server • Terminal server: timesharing returns • …. Many more.

Remember Your Roots?

MAD at 50 Gbpsi • MAD: Magnetic Aerial Density: 3-10 Mbpsi in products 28 Mbpsi in lab 50 Mbpsi = paramagnetic limit but…. People have ideas. • Capacity: rise 10x in 5 years (conservative) • Bandwidth: rise 4x in 5 years (density+rpm) • Disk: 50GB to 500 GB, • 60-80MBps • 1k$/TB • 15 minute to 3 hour scan time.

The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

Disk 47 GB 15 MBps 5 ms seek time 3 ms rotate latency 9$/GB for drive 3$/GB for ctlrs/cabinet 4 TB/rack Tape 40 GB 5 MBps 30 sec pick time Many minute seek time 5$/GB for media10$/GB for drive+library 10 TB/rack Disk vs Tape Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing

Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 512MB and 3$/MB • Disk: GB and $/GB: today at 50GB and 10$/GB • Tape: TB and $/TB: today at 50GB and 12k$/TB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 10 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1 GB/s • Disk: 15 MB/s - - - Arrays can go to 1GB/s • Tape: 5 MB/s - - - striping is problematic, but “works”

New Storage Metrics: Kaps, Maps, SCAN? • Kaps: How many kilobyte objects served per second • The file server, transaction processing metric • This is the OLD metric. • Maps: How many megabyte objects served per second • The Multi-Media metric • SCAN: How long to scan all the data • the data mining and utility metric • And • Kaps/$, Maps/$, TBscan/$

The Access Time Myth • The Myth: seek or pick time dominates • The reality: (1) Queuing dominates • (2) Transfer dominates BLOBs • (3) Disk seeks often short • Implication: many cheap servers better than one fast expensive server • shorter queues • parallel transfer • lower cost/access and cost/byte • This is obvious for disk arrays • This even more obvious for tape arrays Wait Transfer Transfer Rotate Rotate Seek Seek

10x better access time 10x more bandwidth 4,000x lower media price DRAM/disk media price ratio changed 1970-1990 100:1 1990-1995 10:1 1995-1997 50:1 today ~ 0.1$pMB disk 30:1 3$pMB dram Storage Ratios Changed

Data on Disk Can Move to RAM in 8 years 30:1 6 years

Outline • The Surprise-Free Future (5 years) • 500 mips cpus for 10$ • 1 Gb RAM chips • MAD at 50 Gbpsi • 10 GBps SANs are ubiquitous • 1 GBps WANs are ubiquitous • Some consequences • Absurd (?) consequences. • Auto-manage storage • Raid10 replaces Raid5 • Disc-packs • Disk is the archive media of choice • A surprising future? • Disks (and other useful things) become supercomputers. • Apps run “in the disk”.

256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disksnow: 5-50 GB 3.5” disks then: 50-500 GB disks Petabyte storage farms (that you can’t back up or restore). Disks >> tapes “Small” disks:One platter one inch 10GB SAN convergence1 GBps point to point is easy 1 GB RAM chips MAD at 50 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 500 mips cpus for 10$ 5 bips cpus at high end The (absurd?) consequences

Further segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips RAM Memory ~ 1 TB The Absurd? Consequences Disks Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB

Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Olympia 100 Memory This Hotel 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min

Consequences • AutoManage Storage • Sixpacks (for arm-limited apps) • Raid5-> Raid10 • Disk-to-disk backup • Smart disks

Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 60k$ today, 10k$ in a few years. • Admin cost >> storage cost??? • Challenge: • Automate ALL storage admin tasks

The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

Extreme case: 1TB disk: Alternatives • Use all the heads in parallel • Scan in 30 minutes • Still one Kaps/5GB • Use one platter per arm • Share power/sheetmetal • Scan in 30 minutes • One KAPS per GB 500 MB/s 1 TB 200 Kaps 500 MB/s 200GB each 1,000 Kaps

Drives shrink (1.8”, 1”) • 150 kaps for 500 GB is VERY cold data • 3 GB/platter today, 30 GB/platter in 5years. • Most disks are ½ full • TPC benchmarks use 9GB drives (need arms or bandwidth). • One solution: smaller form factor • More arms per GB • More arms per rack • More arms per Watt

Prediction: 6-packs • One way or another, when disks get huge • Will be packaged as multiple arms • Parallel heads gives bandwidth • Independent arms gives bandwidth & aps • Package shares power, package, interfaces…

Stripes, Mirrors, Parity (RAID 0,1, 5) • RAID 0: Stripes • bandwidth • RAID 1: Mirrors, Shadows,… • Fault tolerance • Reads faster, writes 2x slower • RAID 5: Parity • Fault tolerance • Reads faster • Writes 4x or 6x slower. 0,3,6,.. 1,4,7,.. 2,5,8,.. 0,1,2,.. 0,1,2,.. 0,2,P2,.. 1,P1,4,.. P0,3,5,..

RAID 5: Performance 225 reads/sec 70 writes/sec Write 4 logical IO, 2 seek + 1.7 rotate SAVES SPACE Performance degrades on failure RAID1 Performance 250 reads/sec 100 writes/sec Write 2 logical IO 2 seek 0.7 rotate SAVES ARMS Performance improves on failure RAID 10 (strips of mirrors) Wins“wastes space, saves arms”

140 arms 4TB 24 racks24 storage processors6+1 in rack Disks = 2.5 GBps IO Controllers = 1.2 GBps IO Ports 500 MBps IO The Storage RackToday

140 arms 50TB 24 racks24 storage processors6+1 in rack Disks = 14 GBps IO Controllers = 5 GBps IO Ports 1 GBps IO My suggestion: move the processors into the storage racks. Storage Rack in 5 years?

It’s hard to archive a PetaByteIt takes a LONG time to restore it. • Store it in two (or more) places online (on disk?). • Scrub it continuously (look for errors) • On failure, refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • Full Apps (e.g. SAP, Exchange/Notes,..)in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to FailClayton M. Christensen.ISBN: 0875845851

The Disk Farm On a Card 14" • The 500GB disc card • An array of discs • Can be used as • 100 discs • 1 striped disc • 50 Fault Tolerant discs • ....etc • LOTS of accesses/second bandwidth

ASIC Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB • Storage • Network • Display M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

Data GravityProcessing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in • Modem • Display • Microphones (speech recognition) & cameras (vision) • Storage: Data storage and analysis

It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

Kilo Mega Giga Tera Peta Exa Zetta Yotta Disks Become Supercomputers • 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk replaces tape? • Disk is super computer!

All Device Controllers will be Cray 1’s Central Processor & Memory • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • Economics (cyberbricks) • Move computation to data (minimize traffic) Tera Byte Backplane

Tera Byte Backplane With Tera Byte Interconnectand Super Computer Adapters • Processing is incidental to • Networking • Storage • UI • Disk Controller/NIC is • faster than device • close to device • Can borrow device package & power • So use idle capacity for computation. • Run app in device. • Both Kim Keeton (UCB) and Erik Riedel (CMU) thesis investigate thisshow benefits of this approach.

Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: CORBA / COM+. Cluster parallelism is VERY important. Tera Byte Backplane Central Processor & Memory Implications Conventional Radical

Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? COM+? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams SIO SIO SAN

Basic Argument for x-Disks • Future disk controller is a super-computer. • 1 bips processor • 128 MB dram • 100 GB disk plus one arm • Connects to SAN via high-level protocols • RPC, HTTP, DCOM, Kerberos, Directory Services,…. • Commands are RPCs • management, security,…. • Services file/web/db/… requests • Managed by general-purpose OS with good dev environment • Move apps to disk to save data movement • need programming environment in controller

The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server • Function gravitates to data. Something = Fixed App Server Everything = App Server

Why Not a Sector Server?(let’s get physical!) • Good idea, that’s what we have today. • But • cache added for performance • Sector remap added for fault tolerance • error reporting and diagnostics added • SCSI commends (reserve,.. are growing) • Sharing problematic (space mgmt, security,…) • Slipping down the slope to a 2-D block server

Why Not a 1-D Block Server?Put A LITTLE on the Disk Server • Tried and true design • HSC - VAX cluster • EMC • IBM Sysplex (3980?) • But look inside • Has a cache • Has space management • Has error reporting & management • Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… • Has locking • Has remote replication • Has an OS • Security is problematic • Low-level interface moves too many bytes

Why Not a 2-D Block Server?Put A LITTLE on the Disk Server • Tried and true design • Cedar -> NFS • file server, cache, space,.. • Open file is many fewer msgs • Grows to have • Directories + Naming • Authentication + access control • RAID 0, 1, 2, 3, 4, 5, 10, 50,… • Locking • Backup/restore/admin • Cooperative caching with client • File Servers are a BIG hit: NetWare™ • SNAP! is my favorite today

Why Not a File Server?Put a Little on the Disk Server • Tried and true design • Auspex, NetApp, ... • Netware • Yes, but look at NetWare • File interface gives you app invocation interface • Became an app server • Mail, DB, Web,…. • Netware had a primitive OS • Hard to program, so optimized wrong thing

Surprise-Free Storage Futures