Knowledge is Power

Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Systems Without Knowledge • System designers often have limited knowledge • About the applications they run • About the other systems they interact with • Result: The “curse of generality” • Missed performance optimizations • Limited functionality • Costly, too

Didacticism and Systems • How to gain knowledge? • Depends on environment • Sometimes it’s easy • A scientific application w/ cooperative developers • Sometimes it’s not • Internals of Microsoft file system

What We Do • Build systems that acquire and exploit knowledge • “Gray box” techniques • Make assumptions, probe + measure, learn something about how something works • Use knowledge to control systems inunexpected ways • Result • Increase functionality, improve performance,increase robustness and manageability too

Outline • Overview • Knowledge and its applications • Gray-box file placement • Semantically-smart disks • Scientific apps, the Grid, and I/O • Conclusions

The People • Gray-box file placement • With James Nugent, Andrea Arpaci-Dusseau • Semantically-smart disks • With Muthian Sivathanu, Vijayan Prabhakaran,Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau • Scientific apps, the Grid, and I/O • With John Bent, Doug Thain, Andrea Arpaci-Dusseau, Miron Livny

Gray-box Control over File Placement

Controlled File Placement • Typical “Unix” file system: Little control over layout • Just a simple API of open(), read(), write(), close() • Some applications want more control • e.g., a web server that knows which files are often accessed together • Usual default: Use the raw disk • Harder to manage, doesn’t integrate w/ other apps

What Might Be Better • Use normal file system • Convenience • Expose control over layout to applications • Control • Do the above without changing the file system • Can’t always change the system you’re using

P PLACE • A gray-box “Information and Control Layer” (ICL) • It’s just a library • Simple API for file placement • Exposes “FFS-like” groups • Place_Creat(file, mode, groupNumber); • No changes to underlying file system P PLACE File System

PLACE Outline • Basic operation • Gray-box knowledge • Key techniques • Assessment • Accuracy • Performance • Conclusions

Allocation Knowledge • Gray-box assumption: “FFS”-like allocation • Splits disk into numerous consecutive “groups” • Spreads directories across groups • Puts files (inodes/data) that are within samedirectory into same “group” • Many variants • Our focus: ext2 (but with other variants in mind)

bar bar Exploiting Knowledge for Control / • Key structure: Shadow Directory Tree (SDT) .h/ foo/ 1/ 2/ n/ • To create a file /foo/bar in group 1: • Create file /.H/1/bar • Rename /.H/1/bar to /foo/bar

Repeat Challenge: Building the SDT • How to ensure that shadow directory for eachgroup K is in the right on-disk location? • Basic approach to creating a directory in a group: • Mkdir(tmp); • If (tmp is in the desired group) • Break; • Bias(); • Point of portability: Bias() routine • Must account for different allocation algorithms

Some Complications • Controlled directory placement • Similar to system initialization (hence, slow) • To speed up, use shadow cache of directories • Crash recovery • Crash may leave junk in SDT • Periodic sweep of SDT cleaner fixes this • Level of control depends on underlying FS • e.g., FFS vs. ext2 behavior for large files

Assessment

Does it work? • Non-place: 250 files in 1 directory • Non-place: 250 files in 10 directories • Non-place: 250 files in 100 directories • PLACE: 250 files in 100 directories into 1 group

Performance (Small Files) • Performance of 250 200-KB file reads (random)

Performance (Big Files) • Each point: Bandwidth attained reading 100-MB file

PLACE Conclusions • PLACE: Gray-box approach to file layout • Simple and effective control over placement • Main technique: Shadow Directory Tree • Use to control placement • Construction and maintenance are keys • Controlled layout can improve performance • Micro-benchmarks • Web server and I/O parameterization(see USENIX ‘03)

Semantically-smart Disk Systems

Semantically-Smart Disk System (SDS) • Disk system that understands file system • Data structures • Operations • Operates underneath unmodified FS • Must discover layout + on-disk structures • Must “reverse engineer” block stream • Exploits knowledge and “smarts” to implement new class of services File System SDS CPU $

SDS Outline • Semantic Knowledge: Acquisition • Off-line • On-line • Semantic Knowledge: Exploitation • Case studies • Conclusions

Data I-Bitmap D-Bitmap Inodes Data I-Bitmap D-Bitmap Superblk Inodes Static Knowledge: File System Layout • Challenge: How to discover layout information? • White-box approach: Embed knowledge in SDS • Trend: FS layout does not change frequently Group 1 Group 2

Layout Discovery with EOF P • EOF: Extraction Of File-systems • Tool to automatically determine layout • Uses gray-box techniques • Basic operation • Start with “soft” model of file system • Probe process (P): Initiates traffic • SDS: Monitors activity from FS • Two distinct tasks: • Classifying blocks by type • Identifying fields within an inode • Result: “Hardened” model of file system structures + fields File System SDS

EOF: More Details • Multi-phase procedure: • Bootstrap: Summary blocks • Data/data bitmaps • Inodes/inode bitmaps • Inode fields, directory entries • Key techniques • Known patterns: Data blocks • Isolation: Know all but one block, one block must be… • Assertions: Check assumptions at each step

EOF: Simplified Example • Create file: Touches many data structures • Directory data, directory inode, file data (known pattern),file inode, data bitmap, inode bitmap • Reset to beginning of file, write block again • File data (known pattern), file inode • Now, can classify inode block (isolation) • Assertion: only two blocks observed

EOF: Overhead and Summary • Performance: A few minutes per GB • Probably OK, only done “once” per new file system • Scales well with faster disks (sequential bandwidth) • Limitations: “FFS”-like file systems (ext2/3, BSD FFS)

Have Knowledge, Will Innovate • Knowing structures is not enough (sometimes) • Data block overloading (data, pointer, directory) • High-level operations not known (create, delete) • Requires new on-line techniques • Direct classification • Indirect classification • Block association • Operation inferencing

File System SDS A Simple Example: Smarter Caching • Modern RAID may have significant cache • Volatile (DRAM) • Non-volatile (NVRAM) • How to exploit semantic informationto cache more intelligently? $

Data I-Bit I-Bit D-Bit D-Bit Inode Super Data Inode Storing Meta-Data in NVRAM • Start with simple meta-data: inodes, bitmaps, etc. • Good for meta-data intensive workloads NVRAM Cache

Data I-Bit I-Bit D-Bit D-Bit Inode Super Data Inode Direct Classification • Given address, determine type directly • Direct classification via bounds check • Given disk address, can check bounds to determine type(superblock, bitmaps, inodes, general data block)

Getting Rid Of The Dead • If file blocks are deleted, remove them from cache • No need to keep dead blocks around • Problem: How to determine if a file is deleted? • Need to look for signs of deletion • Three different places to look: • Inode bitmaps • Directory that contains file • Inode itself • Operation inferencing via block differencing

SDS = I-Bitmap Read Old Version Data I-Bit I-Bit D-Bit D-Bit Inode Super Data Inode Operation Inferencing: Detecting Deletes (Inode Bitmap) Diff Result:Deleted Files

Operation Inferencing: Overheads • Space overhead • Block cache of inodes, indirect pointers, bitmaps, etc.(could be substantial) • Time overhead • CPU: Difference operation is like an extra copy • Disk: May require block read (if small/no cache) • [In paper: Quantified time and space overheads] • Main point: There is a CPU and memory cost

Case Studies

Experimental Set-up • Problem: Don’t have SDS hardware to use (yet!) • “Cost-effective” alternative: • Software prototype • Insert driver underneath of FS • Much like software RAID • Good because… • Traffic stream similar • Bad because… • CPU, memory not isolated from host File System OS SDS

Fast RAID Reconstruction • Observe: When reconstructing data onto hot spare,no need to reconstruct data that isn’t live • Trend: Less live data in performance-sensitive I/O systems • Question: How can we performreconstruction quickly? Hot Spare Mirror

Traditional Approaches • Why not in the file system? • File system doesn’t know what RAID is • Why not in the storage system? • RAID doesn’t know what blocks are live(minimally it does, if block has never been written)

The Semantic Way • Easy: Scan disk, only copy live blocks • Key piece of knowledge: Bitmaps • Plus, need to watch for “unmapped” writes • Optionally, can copy “dead” blocks later • Useful if SDS doesn’t feel “sure” about its knowledge • Guaranteed correct with prioritized recovery

Fast Reconstruction: A Graph RAID-5, IBMDisks • Fast reconstruction: Less live data -> less time • How data is spread across disk affects recovery time

Semantic Conclusions • Innovation in traditional storage stack is limited • File system: high but not low-level info • Storage system: low but not high-level info • Semantically-smart disks: Best of both worlds? • Takes advantage of “smart” disk systems • Exploit low-level information… • …with high-level knowledge of file system • A remaining challenge • Overcoming the “file system obfuscation” problem

Trends in Scientific Computing • What constitutes a job is increasingly complex • Not your simple process anymore • Data demands increasing • Not just cycles anymore • Wide-area collaboration • “Grids” facilitate sharing

The Question • How to run scientific workloads on the WAN? Remote Home WAN

Scientific Outline • Typical “scientific” jobs • Structure • Properties • Migratory file services • Components • Performance • Conclusions

First Things First • Study of modern scientific applications • A “measure then build” approach • Suite of six applications • BLAST: Searches genomic databases for matching proteins • IBIS: Global-scale simulation of earth systems • CMS: High-energy physics testing software • Nautilus: Simulation of molecular dynamics • Messkit Hartree-Fock: Simulation of atomic interactions • AMANDA: Astrophysics simulation of cosmic events

4K 126M 23M 26M 5M 1M 3M 505M An Example: AMANDA 3601s 955s 42s 2188s • A single “job” is a multi-process pipeline -> batch pipelined • Each process is a blue circle • There are many types of I/O • Endpoint (red): unique input/output of pipeline • Pipeline private (green): shared between pipe processes • Batch shared (yellow): shared across all pipes in batch

Some Things We Learned • Demands of a single pipeline are modest • Modern PC with disk can handle demand • Aggregation of I/O could be harder (WAN) • Lots of sharing of data within and across pipelines • Systems should (have to?) take advantage of this

Knowledge is Power