1 / 53

Storage: Alternate Futures

Storage: Alternate Futures. Yotta Zetta Exa Peta Tera Giga Mega Kilo. Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks IBM Almaden, 1 December 1999. Acknowledgments: Thank You!!. Dave Patterson: Convinced me that processors are moving to the devices.

dkristi
Download Presentation

Storage: Alternate Futures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage: Alternate Futures Yotta Zetta Exa Peta Tera Giga Mega Kilo Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks IBM Almaden, 1 December 1999

  2. Acknowledgments: Thank You!! • Dave Patterson: • Convinced me that processors are moving to the devices. • Kim Keeton and Erik Riedell • Showed that many useful subtasks can be done by disk-processors, and quantified execution interval • Remzi Dusseau • Re-validated Amdahl's laws

  3. Outline • The Surprise-Free Future (5 years) • 500 mips cpus for 10$ • 1 Gb RAM chips • MAD at 50 Gbpsi • 10 GBps SANs are ubiquitous • 1 GBps WANs are ubiquitous • Some consequences • Absurd (?) consequences. • Auto-manage storage • Raid10 replaces Raid5 • Disc-packs • Disk is the archive media of choice • A surprising future? • Disks (and other useful things) become supercomputers. • Apps run “in the disk”

  4. The Surprise-free Storage Future • 1 Gb RAM chips • MAD at 50 Gbpsi • Drives shrink one quantum • Standard IO • 10 GBps SANs are ubiquitous • 1 Gbps WANs are ubiquitous • 5 bips cpus for 1K$ and 500 mips cpus for 10$

  5. 1 Gb RAM Chips • Moving to 256 Mb chips now • 1Gb will be “standard” in 5 years, 4 Gb will be premium product. • Note: • 256Mb = 32MB: the smallest memory • 1 Gb = 128 MB: the smallest memory

  6. System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.

  7. 500 mips System On A Chip for 10$ • 486 now 7$ 233 MHz ARM for 10$ system on a chiphttp://www.cirrus.com/news/products99/news-product14.html AMD/Celeron 266 ~ 30$ • In 5 years, today’s leading edge will be • System on chip (cpu, cache, mem ctlr, multiple IO) • Low cost • Low-power • Have integrated IO • High end is 5 BIPS cpus

  8. Standard IO in 5 Years • Probably • Replace PCI with something better will still need a mezzanine bus standard • Multiple serial links directly from processor • Fast (10 GBps/link) for a few meters • System Area Networks (SANS) ubiquitous (VIA morphs to SIO?)

  9. Ubiquitous 10 GBps SANs in 5 years 1 GBps • 1Gbps Ethernet are reality now. • Also FiberChannel ,MyriNet, GigaNet, ServerNet,, ATM,… • 10 Gbps x4 WDM deployed now (OC192) • 3 Tbps WDM working in lab • In 5 years, expect 10x, progress is astonishing • Gilder’s law: Bandwidth grows 3x/yearhttp://www.forbes.com/asap/97/0407/090.htm 120 MBps (1Gbps) 80 MBps 5 MBps 40 MBps 20 Mbsp

  10. Thin Client’s mean HUGE servers • AOL hosting customer pictures • Hotmail allows 5 MB/user, 50 M users • Web sites offer electronic vaulting for SOHO. • IntelliMirror: replicate client state on server • Terminal server: timesharing returns • …. Many more.

  11. Remember Your Roots?

  12. MAD at 50 Gbpsi • MAD: Magnetic Aerial Density: 3-10 Mbpsi in products 28 Mbpsi in lab 50 Mbpsi = paramagnetic limit but…. People have ideas. • Capacity: rise 10x in 5 years (conservative) • Bandwidth: rise 4x in 5 years (density+rpm) • Disk: 50GB to 500 GB, • 60-80MBps • 1k$/TB • 15 minute to 3 hour scan time.

  13. The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

  14. Disk 47 GB 15 MBps 5 ms seek time 3 ms rotate latency 9$/GB for drive 3$/GB for ctlrs/cabinet 4 TB/rack Tape 40 GB 5 MBps 30 sec pick time Many minute seek time 5$/GB for media10$/GB for drive+library 10 TB/rack Disk vs Tape Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing

  15. Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 512MB and 3$/MB • Disk: GB and $/GB: today at 50GB and 10$/GB • Tape: TB and $/TB: today at 50GB and 12k$/TB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 10 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1 GB/s • Disk: 15 MB/s - - - Arrays can go to 1GB/s • Tape: 5 MB/s - - - striping is problematic, but “works”

  16. New Storage Metrics: Kaps, Maps, SCAN? • Kaps: How many kilobyte objects served per second • The file server, transaction processing metric • This is the OLD metric. • Maps: How many megabyte objects served per second • The Multi-Media metric • SCAN: How long to scan all the data • the data mining and utility metric • And • Kaps/$, Maps/$, TBscan/$

  17. The Access Time Myth • The Myth: seek or pick time dominates • The reality: (1) Queuing dominates • (2) Transfer dominates BLOBs • (3) Disk seeks often short • Implication: many cheap servers better than one fast expensive server • shorter queues • parallel transfer • lower cost/access and cost/byte • This is obvious for disk arrays • This even more obvious for tape arrays Wait Transfer Transfer Rotate Rotate Seek Seek

  18. 10x better access time 10x more bandwidth 4,000x lower media price DRAM/disk media price ratio changed 1970-1990 100:1 1990-1995 10:1 1995-1997 50:1 today ~ 0.1$pMB disk 30:1 3$pMB dram Storage Ratios Changed

  19. Data on Disk Can Move to RAM in 8 years 30:1 6 years

  20. Outline • The Surprise-Free Future (5 years) • 500 mips cpus for 10$ • 1 Gb RAM chips • MAD at 50 Gbpsi • 10 GBps SANs are ubiquitous • 1 GBps WANs are ubiquitous • Some consequences • Absurd (?) consequences. • Auto-manage storage • Raid10 replaces Raid5 • Disc-packs • Disk is the archive media of choice • A surprising future? • Disks (and other useful things) become supercomputers. • Apps run “in the disk”.

  21. 256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disksnow: 5-50 GB 3.5” disks then: 50-500 GB disks Petabyte storage farms (that you can’t back up or restore). Disks >> tapes “Small” disks:One platter one inch 10GB SAN convergence1 GBps point to point is easy 1 GB RAM chips MAD at 50 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 500 mips cpus for 10$ 5 bips cpus at high end The (absurd?) consequences

  22. Further segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips RAM Memory ~ 1 TB The Absurd? Consequences Disks Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB

  23. Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto Disk 2 Years 10 1.5 hr Olympia 100 Memory This Hotel 10 10 min On Board Cache 2 On Chip Cache This Room 1 Registers My Head 1 min

  24. Consequences • AutoManage Storage • Sixpacks (for arm-limited apps) • Raid5-> Raid10 • Disk-to-disk backup • Smart disks

  25. Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 60k$ today, 10k$ in a few years. • Admin cost >> storage cost??? • Challenge: • Automate ALL storage admin tasks

  26. The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

  27. Extreme case: 1TB disk: Alternatives • Use all the heads in parallel • Scan in 30 minutes • Still one Kaps/5GB • Use one platter per arm • Share power/sheetmetal • Scan in 30 minutes • One KAPS per GB 500 MB/s 1 TB 200 Kaps 500 MB/s 200GB each 1,000 Kaps

  28. Drives shrink (1.8”, 1”) • 150 kaps for 500 GB is VERY cold data • 3 GB/platter today, 30 GB/platter in 5years. • Most disks are ½ full • TPC benchmarks use 9GB drives (need arms or bandwidth). • One solution: smaller form factor • More arms per GB • More arms per rack • More arms per Watt

  29. Prediction: 6-packs • One way or another, when disks get huge • Will be packaged as multiple arms • Parallel heads gives bandwidth • Independent arms gives bandwidth & aps • Package shares power, package, interfaces…

  30. Stripes, Mirrors, Parity (RAID 0,1, 5) • RAID 0: Stripes • bandwidth • RAID 1: Mirrors, Shadows,… • Fault tolerance • Reads faster, writes 2x slower • RAID 5: Parity • Fault tolerance • Reads faster • Writes 4x or 6x slower. 0,3,6,.. 1,4,7,.. 2,5,8,.. 0,1,2,.. 0,1,2,.. 0,2,P2,.. 1,P1,4,.. P0,3,5,..

  31. RAID 5: Performance 225 reads/sec 70 writes/sec Write 4 logical IO, 2 seek + 1.7 rotate SAVES SPACE Performance degrades on failure RAID1 Performance 250 reads/sec 100 writes/sec Write 2 logical IO 2 seek 0.7 rotate SAVES ARMS Performance improves on failure RAID 10 (strips of mirrors) Wins“wastes space, saves arms”

  32. 140 arms 4TB 24 racks24 storage processors6+1 in rack Disks = 2.5 GBps IO Controllers = 1.2 GBps IO Ports 500 MBps IO The Storage RackToday

  33. 140 arms 50TB 24 racks24 storage processors6+1 in rack Disks = 14 GBps IO Controllers = 5 GBps IO Ports 1 GBps IO My suggestion: move the processors into the storage racks. Storage Rack in 5 years?

  34. It’s hard to archive a PetaByteIt takes a LONG time to restore it. • Store it in two (or more) places online (on disk?). • Scrub it continuously (look for errors) • On failure, refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

  35. Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • Full Apps (e.g. SAP, Exchange/Notes,..)in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to FailClayton M. Christensen.ISBN: 0875845851

  36. The Disk Farm On a Card 14" • The 500GB disc card • An array of discs • Can be used as • 100 discs • 1 striped disc • 50 Fault Tolerant discs • ....etc • LOTS of accesses/second bandwidth

  37. ASIC Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB • Storage • Network • Display M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC

  38. Data GravityProcessing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in • Modem • Display • Microphones (speech recognition) & cameras (vision) • Storage: Data storage and analysis

  39. It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

  40. Kilo Mega Giga Tera Peta Exa Zetta Yotta Disks Become Supercomputers • 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk replaces tape? • Disk is super computer!

  41. All Device Controllers will be Cray 1’s Central Processor & Memory • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • Economics (cyberbricks) • Move computation to data (minimize traffic) Tera Byte Backplane

  42. Tera Byte Backplane With Tera Byte Interconnectand Super Computer Adapters • Processing is incidental to • Networking • Storage • UI • Disk Controller/NIC is • faster than device • close to device • Can borrow device package & power • So use idle capacity for computation. • Run app in device. • Both Kim Keeton (UCB) and Erik Riedel (CMU) thesis investigate thisshow benefits of this approach.

  43. Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: CORBA / COM+. Cluster parallelism is VERY important. Tera Byte Backplane Central Processor & Memory Implications Conventional Radical

  44. Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? COM+? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams SIO SIO SAN

  45. Basic Argument for x-Disks • Future disk controller is a super-computer. • 1 bips processor • 128 MB dram • 100 GB disk plus one arm • Connects to SAN via high-level protocols • RPC, HTTP, DCOM, Kerberos, Directory Services,…. • Commands are RPCs • management, security,…. • Services file/web/db/… requests • Managed by general-purpose OS with good dev environment • Move apps to disk to save data movement • need programming environment in controller

  46. The Slippery Slope Nothing = Sector Server • If you add function to server • Then you add more function to server • Function gravitates to data. Something = Fixed App Server Everything = App Server

  47. Why Not a Sector Server?(let’s get physical!) • Good idea, that’s what we have today. • But • cache added for performance • Sector remap added for fault tolerance • error reporting and diagnostics added • SCSI commends (reserve,.. are growing) • Sharing problematic (space mgmt, security,…) • Slipping down the slope to a 2-D block server

  48. Why Not a 1-D Block Server?Put A LITTLE on the Disk Server • Tried and true design • HSC - VAX cluster • EMC • IBM Sysplex (3980?) • But look inside • Has a cache • Has space management • Has error reporting & management • Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… • Has locking • Has remote replication • Has an OS • Security is problematic • Low-level interface moves too many bytes

  49. Why Not a 2-D Block Server?Put A LITTLE on the Disk Server • Tried and true design • Cedar -> NFS • file server, cache, space,.. • Open file is many fewer msgs • Grows to have • Directories + Naming • Authentication + access control • RAID 0, 1, 2, 3, 4, 5, 10, 50,… • Locking • Backup/restore/admin • Cooperative caching with client • File Servers are a BIG hit: NetWare™ • SNAP! is my favorite today

  50. Why Not a File Server?Put a Little on the Disk Server • Tried and true design • Auspex, NetApp, ... • Netware • Yes, but look at NetWare • File interface gives you app invocation interface • Became an app server • Mail, DB, Web,…. • Netware had a primitive OS • Hard to program, so optimized wrong thing

More Related