1 / 13

Provenance-aware Storage Systems

Provenance-aware Storage Systems. Kiran-Kumar Muniswamy-Reddy David A. Holland Uri Braun Margo Seltzer Harvard University. Provenance-aware storage systems (PASS). Provenance (lineage) is the ownership history of an object

raisie
Download Presentation

Provenance-aware Storage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance-awareStorage Systems Kiran-Kumar Muniswamy-ReddyDavid A. HollandUri BraunMargo SeltzerHarvard University

  2. Provenance-aware storage systems (PASS) • Provenance (lineage) is the ownership history of an object • In a FS context, provenance is “a description of the execution history that produced a persistent object” • Queries of provenance can answer questions like: • Who is using my dataset? • On whose data does my result depend? • Two possible approaches • Disclosed provenance • Depend on apps and users to record provenance • Rich semantic knowledge • Observed provenance • System transparently records, maintains provenance data • Little semantic knowledge, but is gathered for all workloads • Authors implemented a PASS filesystem (PASTA) which automatically gathers provenance in a UNIX environment

  3. For each file in the filesystem, record The executable that created it Any input files “Complete” hardware platform description Command line Process environment Other data such as random seeds > sort a > b Provenance records

  4. Tasks enabled by PASS • Script generation • Generate a Makefile that reproduces a file • Detecting system changes • Compare provenance of two files to detect changes in environment, libraries, etc. • Intrusion detection • Detailed logs of how objects have changed • Retrieving compile-time flags • In case you forgot how you compiled something • Build debugging • Avoid needing to “make clean” after any change • Understand system dependencies • e.g., objects depended on /bin/mount because libc reads the mount table frequently • User can manually choose files to be ignored by PASTA

  5. PASS Implementation • Collector kernel module intercepts syscalls and generates provenance records • Per-process provenance information kept in memory • Records are written to disk • Duplicate elimination • Coalesce entries from repeated syscalls • Versions • Filesystem data is not versioned but provenance records are • Node merging for cycle elimination • Merge the provenance of sets of processes that produce cycles • Approx 5000 lines of in-kernel code • Not including in-kernel Berkeley DB

  6. PASTA – the storage layer • Stacked on ext2 using FiST • Not clear why a storage layer is needed • Maybe to guarantee that the metadata follows the data? • In-kernel Berkeley DB (KBDB) stores five provenance tables • Provenance: main repository of records • Map: map inodes to pnodes • Argdata: assign sequence numbers to each command line and environment record • Argreverse: reverse index of Argdata • Argindex: secondary index of cmdline and environment components to sequence numbers

  7. Queries • Conventional attribute lookup • Transitive closure of ancestry or descendancy information • Query tools act on the provenance databases • Provenance Explorer allows users to browse the filesystem and make point queries • Makefile Generator produces the set of commands that led to a file’s current state

  8. Evaluation: Performance • Small file microbenchmark: • Create, read, write, sync, delete 2500 4KB files in 100 directories • 2X time overhead for small files • Large file microbenchmark • Write then read 100MB sequentially, write then read random 256KB chunks • 2-15% time overhead for large files • Build Linux 2.4.29 kernel then generate Makefiles for every resulting file • Fast – only 65ms per file thanks to DB index

  9. Evaluation: Provenance growth • After kernel build, append a comment to N random files and rebuild kernel

  10. Evaluation: One user’s experience • Computational biologist who uses blast, a tool to find regions of similarity between biological sequences • One program generates databse files, blast does the comparison, then some perl scripts clean it up • After workflow, biologist uses PASS query tools to generate Makefiles with specific commands • Reports runtime overhead of 1.65%

  11. Prototype capabilities and limitations • Collects and maintains provenance w/out apriori workload knowledge • Cannot generate provenance for files from non-provenanceified machines • No security and access control • e.g., An employee review should be readable by the employee, but includes input from colleagues that should be private • Future work • Simple query capabilities

  12. Research challenges • Security model • Cycle-breaking • Provenance pruning • e.g., when deleting a file with long chains of pnodes • Integrate with other provenance-gathering apps, systems • Network-aware PASS systems • Integrate with file versioning

  13. Lingering questions… • Overhead for systems files? • Collect provenance for system daemons? • Deeper evaluation of provenance time/space costs over time • Provenance in aging filesystems • User studies • Who wants this?

More Related