1 / 60

Knowledge is Power

Knowledge is Power. Remzi Arpaci-Dusseau University of Wisconsin, Madison. Systems Without Knowledge. System designers often have limited knowledge About the applications they run About the other systems they interact with Result: The “curse of generality” Missed performance optimizations

nantai
Download Presentation

Knowledge is Power

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

  2. Systems Without Knowledge • System designers often have limited knowledge • About the applications they run • About the other systems they interact with • Result: The “curse of generality” • Missed performance optimizations • Limited functionality • Costly, too

  3. Didacticism and Systems • How to gain knowledge? • Depends on environment • Sometimes it’s easy • A scientific application w/ cooperative developers • Sometimes it’s not • Internals of Microsoft file system

  4. What We Do • Build systems that acquire and exploit knowledge • “Gray box” techniques • Make assumptions, probe + measure, learn something about how something works • Use knowledge to control systems inunexpected ways • Result • Increase functionality, improve performance,increase robustness and manageability too

  5. Outline • Overview • Knowledge and its applications • Gray-box file placement • Semantically-smart disks • Scientific apps, the Grid, and I/O • Conclusions

  6. The People • Gray-box file placement • With James Nugent, Andrea Arpaci-Dusseau • Semantically-smart disks • With Muthian Sivathanu, Vijayan Prabhakaran,Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau • Scientific apps, the Grid, and I/O • With John Bent, Doug Thain, Andrea Arpaci-Dusseau, Miron Livny

  7. Gray-box Control over File Placement

  8. Controlled File Placement • Typical “Unix” file system: Little control over layout • Just a simple API of open(), read(), write(), close() • Some applications want more control • e.g., a web server that knows which files are often accessed together • Usual default: Use the raw disk • Harder to manage, doesn’t integrate w/ other apps

  9. What Might Be Better • Use normal file system • Convenience • Expose control over layout to applications • Control • Do the above without changing the file system • Can’t always change the system you’re using

  10. P PLACE • A gray-box “Information and Control Layer” (ICL) • It’s just a library • Simple API for file placement • Exposes “FFS-like” groups • Place_Creat(file, mode, groupNumber); • No changes to underlying file system P PLACE File System

  11. PLACE Outline • Basic operation • Gray-box knowledge • Key techniques • Assessment • Accuracy • Performance • Conclusions

  12. Allocation Knowledge • Gray-box assumption: “FFS”-like allocation • Splits disk into numerous consecutive “groups” • Spreads directories across groups • Puts files (inodes/data) that are within samedirectory into same “group” • Many variants • Our focus: ext2 (but with other variants in mind)

  13. bar bar Exploiting Knowledge for Control / • Key structure: Shadow Directory Tree (SDT) .h/ foo/ 1/ 2/ n/ • To create a file /foo/bar in group 1: • Create file /.H/1/bar • Rename /.H/1/bar to /foo/bar

  14. Repeat Challenge: Building the SDT • How to ensure that shadow directory for eachgroup K is in the right on-disk location? • Basic approach to creating a directory in a group: • Mkdir(tmp); • If (tmp is in the desired group) • Break; • Bias(); • Point of portability: Bias() routine • Must account for different allocation algorithms

  15. Some Complications • Controlled directory placement • Similar to system initialization (hence, slow) • To speed up, use shadow cache of directories • Crash recovery • Crash may leave junk in SDT • Periodic sweep of SDT cleaner fixes this • Level of control depends on underlying FS • e.g., FFS vs. ext2 behavior for large files

  16. Assessment

  17. Does it work? • Non-place: 250 files in 1 directory • Non-place: 250 files in 10 directories • Non-place: 250 files in 100 directories • PLACE: 250 files in 100 directories into 1 group

  18. Performance (Small Files) • Performance of 250 200-KB file reads (random)

  19. Performance (Big Files) • Each point: Bandwidth attained reading 100-MB file

  20. PLACE Conclusions • PLACE: Gray-box approach to file layout • Simple and effective control over placement • Main technique: Shadow Directory Tree • Use to control placement • Construction and maintenance are keys • Controlled layout can improve performance • Micro-benchmarks • Web server and I/O parameterization(see USENIX ‘03)

  21. Outline • Overview • Knowledge and its applications • Gray-box file placement • Semantically-smart disks • Scientific apps, the Grid, and I/O • Conclusions

  22. Semantically-smart Disk Systems

  23. Semantically-Smart Disk System (SDS) • Disk system that understands file system • Data structures • Operations • Operates underneath unmodified FS • Must discover layout + on-disk structures • Must “reverse engineer” block stream • Exploits knowledge and “smarts” to implement new class of services File System SDS CPU $

  24. SDS Outline • Semantic Knowledge: Acquisition • Off-line • On-line • Semantic Knowledge: Exploitation • Case studies • Conclusions

  25. Data I-Bitmap D-Bitmap Inodes Data I-Bitmap D-Bitmap Superblk Inodes Static Knowledge: File System Layout • Challenge: How to discover layout information? • White-box approach: Embed knowledge in SDS • Trend: FS layout does not change frequently Group 1 Group 2

  26. Layout Discovery with EOF P • EOF: Extraction Of File-systems • Tool to automatically determine layout • Uses gray-box techniques • Basic operation • Start with “soft” model of file system • Probe process (P): Initiates traffic • SDS: Monitors activity from FS • Two distinct tasks: • Classifying blocks by type • Identifying fields within an inode • Result: “Hardened” model of file system structures + fields File System SDS

  27. EOF: More Details • Multi-phase procedure: • Bootstrap: Summary blocks • Data/data bitmaps • Inodes/inode bitmaps • Inode fields, directory entries • Key techniques • Known patterns: Data blocks • Isolation: Know all but one block, one block must be… • Assertions: Check assumptions at each step

  28. EOF: Simplified Example • Create file: Touches many data structures • Directory data, directory inode, file data (known pattern),file inode, data bitmap, inode bitmap • Reset to beginning of file, write block again • File data (known pattern), file inode • Now, can classify inode block (isolation) • Assertion: only two blocks observed

  29. EOF: Overhead and Summary • Performance: A few minutes per GB • Probably OK, only done “once” per new file system • Scales well with faster disks (sequential bandwidth) • Limitations: “FFS”-like file systems (ext2/3, BSD FFS)

  30. Have Knowledge, Will Innovate • Knowing structures is not enough (sometimes) • Data block overloading (data, pointer, directory) • High-level operations not known (create, delete) • Requires new on-line techniques • Direct classification • Indirect classification • Block association • Operation inferencing

  31. File System SDS A Simple Example: Smarter Caching • Modern RAID may have significant cache • Volatile (DRAM) • Non-volatile (NVRAM) • How to exploit semantic informationto cache more intelligently? $

  32. Data I-Bit I-Bit D-Bit D-Bit Inode Super Data Inode Storing Meta-Data in NVRAM • Start with simple meta-data: inodes, bitmaps, etc. • Good for meta-data intensive workloads NVRAM Cache

  33. Data I-Bit I-Bit D-Bit D-Bit Inode Super Data Inode Direct Classification • Given address, determine type directly • Direct classification via bounds check • Given disk address, can check bounds to determine type(superblock, bitmaps, inodes, general data block)

  34. Getting Rid Of The Dead • If file blocks are deleted, remove them from cache • No need to keep dead blocks around • Problem: How to determine if a file is deleted? • Need to look for signs of deletion • Three different places to look: • Inode bitmaps • Directory that contains file • Inode itself • Operation inferencing via block differencing

  35. SDS = I-Bitmap Read Old Version Data I-Bit I-Bit D-Bit D-Bit Inode Super Data Inode Operation Inferencing: Detecting Deletes (Inode Bitmap) Diff Result:Deleted Files

  36. Operation Inferencing: Overheads • Space overhead • Block cache of inodes, indirect pointers, bitmaps, etc.(could be substantial) • Time overhead • CPU: Difference operation is like an extra copy • Disk: May require block read (if small/no cache) • [In paper: Quantified time and space overheads] • Main point: There is a CPU and memory cost

  37. Case Studies

  38. Experimental Set-up • Problem: Don’t have SDS hardware to use (yet!) • “Cost-effective” alternative: • Software prototype • Insert driver underneath of FS • Much like software RAID • Good because… • Traffic stream similar • Bad because… • CPU, memory not isolated from host File System OS SDS

  39. Fast RAID Reconstruction • Observe: When reconstructing data onto hot spare,no need to reconstruct data that isn’t live • Trend: Less live data in performance-sensitive I/O systems • Question: How can we performreconstruction quickly? Hot Spare Mirror

  40. Traditional Approaches • Why not in the file system? • File system doesn’t know what RAID is • Why not in the storage system? • RAID doesn’t know what blocks are live(minimally it does, if block has never been written)

  41. The Semantic Way • Easy: Scan disk, only copy live blocks • Key piece of knowledge: Bitmaps • Plus, need to watch for “unmapped” writes • Optionally, can copy “dead” blocks later • Useful if SDS doesn’t feel “sure” about its knowledge • Guaranteed correct with prioritized recovery

  42. Fast Reconstruction: A Graph RAID-5, IBMDisks • Fast reconstruction: Less live data -> less time • How data is spread across disk affects recovery time

  43. Semantic Conclusions • Innovation in traditional storage stack is limited • File system: high but not low-level info • Storage system: low but not high-level info • Semantically-smart disks: Best of both worlds? • Takes advantage of “smart” disk systems • Exploit low-level information… • …with high-level knowledge of file system • A remaining challenge • Overcoming the “file system obfuscation” problem

  44. Outline • Overview • Knowledge and its applications • Gray-box file placement • Semantically-smart disks • Scientific apps, the Grid, and I/O • Conclusions

  45. Trends in Scientific Computing • What constitutes a job is increasingly complex • Not your simple process anymore • Data demands increasing • Not just cycles anymore • Wide-area collaboration • “Grids” facilitate sharing

  46. The Question • How to run scientific workloads on the WAN? Remote Home WAN

  47. Scientific Outline • Typical “scientific” jobs • Structure • Properties • Migratory file services • Components • Performance • Conclusions

  48. First Things First • Study of modern scientific applications • A “measure then build” approach • Suite of six applications • BLAST: Searches genomic databases for matching proteins • IBIS: Global-scale simulation of earth systems • CMS: High-energy physics testing software • Nautilus: Simulation of molecular dynamics • Messkit Hartree-Fock: Simulation of atomic interactions • AMANDA: Astrophysics simulation of cosmic events

  49. 4K 126M 23M 26M 5M 1M 3M 505M An Example: AMANDA 3601s 955s 42s 2188s • A single “job” is a multi-process pipeline -> batch pipelined • Each process is a blue circle • There are many types of I/O • Endpoint (red): unique input/output of pipeline • Pipeline private (green): shared between pipe processes • Batch shared (yellow): shared across all pipes in batch

  50. Some Things We Learned • Demands of a single pipeline are modest • Modern PC with disk can handle demand • Aggregation of I/O could be harder (WAN) • Lots of sharing of data within and across pipelines • Systems should (have to?) take advantage of this

More Related