1 / 27

Helped me sharpen these arguments

Storage Bricks Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen. Helped me sharpen

mtapp
Download Presentation

Helped me sharpen these arguments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Storage Bricks Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002Acknowledgements:Dave Pattersonexplained this to me long agoLeonard ChungKim Keeton Erik RiedelCatharine Van Ingen Helped me sharpen these arguments

  2. First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software(tubes not transistors)

  3. 10 years later 1.6 meters

  4. Kilo Mega Giga Tera Peta Exa Zetta Yotta Disk Evolution • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB 1” micro-drive • System on a chip • High-speed SAN • Disk replacing tape • Disk is super computer!

  5. Disks are becoming computers • Smart drives • Camera with micro-drive • Replay / Tivo / Ultimate TV • Phone with micro-drive • MP3 players • Tablet • Xbox • Many more… ApplicationsWeb, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio…

  6. ASIC Today: P=50 mips M= 2 MB ASIC In a few years P= 500 mips M= 256 MB ASIC Data Gravity Processing Moves to Transducerssmart displays, microphones, printers, NICs, disks Processing decentralized Moving to data sources Moving to power sources Moving to sheet metal ? The end of computers ? Storage Network Display

  7. It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

  8. Segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips RAM ~ 1 TB The Absurd Design? Disks Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB

  9. The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! • Optimizations: • Reduce management costs • Caching • Sequential 100x faster than random 200$ 1 TB 100 MB/s 200 Kaps

  10. Disk = Node • magnetic storage (1TB) • processor + RAM + LAN • Management interface (HTTP + SOAP) • Application execution environment • Application • File • DB2/Oracle/SQL • Notes/Exchange/TeamServer • SAP/Seibold/… • Quickbooks /Tivo/ PC.… Applications Services DBMS RPC, ... File System LAN driver Disk driver OS Kernel

  11. Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: SOAP/DCOM/RMI.. Cluster parallelism is VERY important. Terabyte/s Backplane Central Processor & Memory Implications Conventional Radical

  12. Intermediate Step: Shared Logic Snap ~1TB 12x80GB NAS • Brick with 8-12 disk drives • 200 mips/arm (or more) • 2xGbpsEthernet • General purpose OS • 10k$/TB to 50k$/TB • Shared • Sheet metal • Power • Support/Config • Security • Network ports • These bricks could run applications (e.g. SQL or Mail or..) NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS

  13. Example • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), • 2.5processors/TB, 1GB RAM/TB • JIT storage & processing3 weeks from order to deploy Slide courtesy of Brewster Kahle, @ Archive.org

  14. What if Disk Replaces Tape?How does it work? • Backup/Restore • RAID (among the federation) • Snapshot copies (in most OSs) • remote replicas (standard in DBMS and FS) • Archive • Use “cold” 95% of disk space • Interchange • Send computers not disks.

  15. It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

  16. Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes • If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC) • So you have 1.6 PB of (mirrored) storage(160GB drives) • Use the “empty” 95% for archive storage. • No extra space or extra power cost. • Very fast access (milliseconds vs hours). • Snapshot is read-only (software enforced ) • Makes Admin easy (saves people costs)

  17. Slide courtesy of Brewster Kahle, @ Archive.org Disk as Tape Archive • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

  18. Disk as Tape Interchange • Tape interchange is frustrating (often unreadable) • Beyond 1-10 GB send media not data • FTP takes too long (hour/GB) • Bandwidth still very expensive (1$/GB) • Writing DVD not much faster than Internet • New technology could change this • 100 GB DVD @ 10MBps would be competitive. • Write 1TB disk in 2.5 hrs (at 100MBps) • But, how does interchange work?

  19. Disk As Tape Interchange: What format? • Today I send 160GB NTFS/SQL disks. • But that is not a good format for Linux/DB2 users. • Solution: Ship NFS/CIFS/ODBC servers (not disks) • Plug “disk” into LAN. • DHCP then file or DB server via standard interface. • “pull” data from server.

  20. Some Questions • What is the product? • How do I manage 10,000 nodes (disks)? • How do I program 10,000 nodes (disks)? • How does RAID work? • How do I backup a PB? • How do I restore a PB?

  21. What is the Product? • Concept: Plug it in and it works! • Music/Video/Photo appliance (home) • Game appliance • “PC” • File server appliance • Data archive/interchange appliance • Web server appliance • DB server • eMail appliance • Application appliance network power

  22. How Does Scale Out Work? • Files: well known designs: • rooted tree partitioned across nodes • Automatic cooling (migration) • Mirrors or Chained declustering • Snapshots for backup/archive • Databases: well known designs • Partitioning, remote replication similar to files • distributed query processing. • Applications: (hypothetical) • Must be designed as mobile objects • Middleware provides object migration system • Objects externalize methods to migrate ( == backup/restore/archive) • Web services seem to have key ideas (xml representation) • Example: eMail object is mailbox

  23. Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 50k$ today, 5k$ in a few years. • Admin cost >> storage cost !!!! • Challenge: • Automate ALL storage admin tasks

  24. Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”) • Google: 1 :100TB 5k$/TB/y • Yahoo! 1 : 50TB 20k$/TB/y • DB 1 : 5TB 60k$/TB/y • Wall St. 1 : 1TB 400k$/TB/y (reported) • hardware dominant cost only @ Google. • How can we waste hardware to save people cost?

  25. How do I manage 10,000 nodes? • You can’t manage 10,000 x (for any x). • They manage themselves. • You manage exceptional exceptions. • Auto Manage • Plug & Play hardware • Auto-load balance & placement storage & processing • Simple parallel programming model • Fault masking

  26. How do I program 10,000 nodes? • You can’t program 10,000 x (for any x). • They program themselves. • You write embarrassingly parallel programs • Examples: SQL, Web, Google, Inktomi, HotMail,…. • PVM and MPI prove it must be automatic (unless you have a PhD)! • Auto Parallelism is ESSENTIAL

  27. Summary • Disks will become supercomputers so • Lots of computing to optimize the arm • Can put app close to the data (better modularity, locality) • Storage appliances (self-organizing) • The arm/capacity tradeoff: “waste” space to save access. • Compression (saves bandwidth) • Mirrors • Online backup/restore • Online archive (vault to other drives or geoplex if possible) • Not disks replace tapes: Storage appliances replace tapes. • Self-organizing storage servers (file systems)(prototypes of this software exist)

More Related