Helped me sharpen these arguments

Storage Bricks Jim Gray Microsoft Research http://Research.Micrsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002Acknowledgements:Dave Pattersonexplained this to me long agoLeonard ChungKim Keeton Erik RiedelCatharine Van Ingen Helped me sharpen these arguments

First Disk 1956 • IBM 305 RAMAC • 4 MB • 50x24” disks • 1200 rpm • 100 ms access • 35k$/y rent • Included computer & accounting software(tubes not transistors)

10 years later 1.6 meters

Kilo Mega Giga Tera Peta Exa Zetta Yotta Disk Evolution • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB 1” micro-drive • System on a chip • High-speed SAN • Disk replacing tape • Disk is super computer!

Disks are becoming computers • Smart drives • Camera with micro-drive • Replay / Tivo / Ultimate TV • Phone with micro-drive • MP3 players • Tablet • Xbox • Many more… ApplicationsWeb, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio…

ASIC Today: P=50 mips M= 2 MB ASIC In a few years P= 500 mips M= 256 MB ASIC Data Gravity Processing Moves to Transducerssmart displays, microphones, printers, NICs, disks Processing decentralized Moving to data sources Moving to power sources Moving to sheet metal ? The end of computers ? Storage Network Display

It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.

Segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips RAM ~ 1 TB The Absurd Design? Disks Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB

The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! • Optimizations: • Reduce management costs • Caching • Sequential 100x faster than random 200$ 1 TB 100 MB/s 200 Kaps

Disk = Node • magnetic storage (1TB) • processor + RAM + LAN • Management interface (HTTP + SOAP) • Application execution environment • Application • File • DB2/Oracle/SQL • Notes/Exchange/TeamServer • SAP/Seibold/… • Quickbooks /Tivo/ PC.… Applications Services DBMS RPC, ... File System LAN driver Disk driver OS Kernel

Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: SOAP/DCOM/RMI.. Cluster parallelism is VERY important. Terabyte/s Backplane Central Processor & Memory Implications Conventional Radical

Intermediate Step: Shared Logic Snap ~1TB 12x80GB NAS • Brick with 8-12 disk drives • 200 mips/arm (or more) • 2xGbpsEthernet • General purpose OS • 10k$/TB to 50k$/TB • Shared • Sheet metal • Power • Support/Config • Security • Network ports • These bricks could run applications (e.g. SQL or Mail or..) NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS

Example • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), • 2.5processors/TB, 1GB RAM/TB • JIT storage & processing3 weeks from order to deploy Slide courtesy of Brewster Kahle, @ Archive.org

What if Disk Replaces Tape?How does it work? • Backup/Restore • RAID (among the federation) • Snapshot copies (in most OSs) • remote replicas (standard in DBMS and FS) • Archive • Use “cold” 95% of disk space • Interchange • Send computers not disks.

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes • If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC) • So you have 1.6 PB of (mirrored) storage(160GB drives) • Use the “empty” 95% for archive storage. • No extra space or extra power cost. • Very fast access (milliseconds vs hours). • Snapshot is read-only (software enforced ) • Makes Admin easy (saves people costs)

Slide courtesy of Brewster Kahle, @ Archive.org Disk as Tape Archive • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Disk as Tape Interchange • Tape interchange is frustrating (often unreadable) • Beyond 1-10 GB send media not data • FTP takes too long (hour/GB) • Bandwidth still very expensive (1$/GB) • Writing DVD not much faster than Internet • New technology could change this • 100 GB DVD @ 10MBps would be competitive. • Write 1TB disk in 2.5 hrs (at 100MBps) • But, how does interchange work?

Disk As Tape Interchange: What format? • Today I send 160GB NTFS/SQL disks. • But that is not a good format for Linux/DB2 users. • Solution: Ship NFS/CIFS/ODBC servers (not disks) • Plug “disk” into LAN. • DHCP then file or DB server via standard interface. • “pull” data from server.

Some Questions • What is the product? • How do I manage 10,000 nodes (disks)? • How do I program 10,000 nodes (disks)? • How does RAID work? • How do I backup a PB? • How do I restore a PB?

What is the Product? • Concept: Plug it in and it works! • Music/Video/Photo appliance (home) • Game appliance • “PC” • File server appliance • Data archive/interchange appliance • Web server appliance • DB server • eMail appliance • Application appliance network power

How Does Scale Out Work? • Files: well known designs: • rooted tree partitioned across nodes • Automatic cooling (migration) • Mirrors or Chained declustering • Snapshots for backup/archive • Databases: well known designs • Partitioning, remote replication similar to files • distributed query processing. • Applications: (hypothetical) • Must be designed as mobile objects • Middleware provides object migration system • Objects externalize methods to migrate ( == backup/restore/archive) • Web services seem to have key ideas (xml representation) • Example: eMail object is mailbox

Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 50k$ today, 5k$ in a few years. • Admin cost >> storage cost !!!! • Challenge: • Automate ALL storage admin tasks

Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”) • Google: 1 :100TB 5k$/TB/y • Yahoo! 1 : 50TB 20k$/TB/y • DB 1 : 5TB 60k$/TB/y • Wall St. 1 : 1TB 400k$/TB/y (reported) • hardware dominant cost only @ Google. • How can we waste hardware to save people cost?

How do I manage 10,000 nodes? • You can’t manage 10,000 x (for any x). • They manage themselves. • You manage exceptional exceptions. • Auto Manage • Plug & Play hardware • Auto-load balance & placement storage & processing • Simple parallel programming model • Fault masking

How do I program 10,000 nodes? • You can’t program 10,000 x (for any x). • They program themselves. • You write embarrassingly parallel programs • Examples: SQL, Web, Google, Inktomi, HotMail,…. • PVM and MPI prove it must be automatic (unless you have a PhD)! • Auto Parallelism is ESSENTIAL

Summary • Disks will become supercomputers so • Lots of computing to optimize the arm • Can put app close to the data (better modularity, locality) • Storage appliances (self-organizing) • The arm/capacity tradeoff: “waste” space to save access. • Compression (saves bandwidth) • Mirrors • Online backup/restore • Online archive (vault to other drives or geoplex if possible) • Not disks replace tapes: Storage appliances replace tapes. • Self-organizing storage servers (file systems)(prototypes of this software exist)

Helped me sharpen these arguments