Provenance-aware Storage Systems

Provenance-awareStorage Systems Kiran-Kumar Muniswamy-ReddyDavid A. HollandUri BraunMargo SeltzerHarvard University

Provenance-aware storage systems (PASS) • Provenance (lineage) is the ownership history of an object • In a FS context, provenance is “a description of the execution history that produced a persistent object” • Queries of provenance can answer questions like: • Who is using my dataset? • On whose data does my result depend? • Two possible approaches • Disclosed provenance • Depend on apps and users to record provenance • Rich semantic knowledge • Observed provenance • System transparently records, maintains provenance data • Little semantic knowledge, but is gathered for all workloads • Authors implemented a PASS filesystem (PASTA) which automatically gathers provenance in a UNIX environment

For each file in the filesystem, record The executable that created it Any input files “Complete” hardware platform description Command line Process environment Other data such as random seeds > sort a > b Provenance records

Tasks enabled by PASS • Script generation • Generate a Makefile that reproduces a file • Detecting system changes • Compare provenance of two files to detect changes in environment, libraries, etc. • Intrusion detection • Detailed logs of how objects have changed • Retrieving compile-time flags • In case you forgot how you compiled something • Build debugging • Avoid needing to “make clean” after any change • Understand system dependencies • e.g., objects depended on /bin/mount because libc reads the mount table frequently • User can manually choose files to be ignored by PASTA

PASS Implementation • Collector kernel module intercepts syscalls and generates provenance records • Per-process provenance information kept in memory • Records are written to disk • Duplicate elimination • Coalesce entries from repeated syscalls • Versions • Filesystem data is not versioned but provenance records are • Node merging for cycle elimination • Merge the provenance of sets of processes that produce cycles • Approx 5000 lines of in-kernel code • Not including in-kernel Berkeley DB

PASTA – the storage layer • Stacked on ext2 using FiST • Not clear why a storage layer is needed • Maybe to guarantee that the metadata follows the data? • In-kernel Berkeley DB (KBDB) stores five provenance tables • Provenance: main repository of records • Map: map inodes to pnodes • Argdata: assign sequence numbers to each command line and environment record • Argreverse: reverse index of Argdata • Argindex: secondary index of cmdline and environment components to sequence numbers

Queries • Conventional attribute lookup • Transitive closure of ancestry or descendancy information • Query tools act on the provenance databases • Provenance Explorer allows users to browse the filesystem and make point queries • Makefile Generator produces the set of commands that led to a file’s current state

Evaluation: Performance • Small file microbenchmark: • Create, read, write, sync, delete 2500 4KB files in 100 directories • 2X time overhead for small files • Large file microbenchmark • Write then read 100MB sequentially, write then read random 256KB chunks • 2-15% time overhead for large files • Build Linux 2.4.29 kernel then generate Makefiles for every resulting file • Fast – only 65ms per file thanks to DB index

Evaluation: Provenance growth • After kernel build, append a comment to N random files and rebuild kernel

Evaluation: One user’s experience • Computational biologist who uses blast, a tool to find regions of similarity between biological sequences • One program generates databse files, blast does the comparison, then some perl scripts clean it up • After workflow, biologist uses PASS query tools to generate Makefiles with specific commands • Reports runtime overhead of 1.65%

Prototype capabilities and limitations • Collects and maintains provenance w/out apriori workload knowledge • Cannot generate provenance for files from non-provenanceified machines • No security and access control • e.g., An employee review should be readable by the employee, but includes input from colleagues that should be private • Future work • Simple query capabilities

Research challenges • Security model • Cycle-breaking • Provenance pruning • e.g., when deleting a file with long chains of pnodes • Integrate with other provenance-gathering apps, systems • Network-aware PASS systems • Integrate with file versioning

Lingering questions… • Overhead for systems files? • Collect provenance for system daemons? • Deeper evaluation of provenance time/space costs over time • Provenance in aging filesystems • User studies • Who wants this?

Provenance-aware Storage Systems

Provenance-aware Storage Systems

Presentation Transcript

Provenance Aware Linked Sensor Data

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

Storage-aware Smartphone Energy Savings

Storage Systems

Storage Systems

Linked Justifications: Provenance Aware Data Integration on Linked Data

Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems

A Workflow-Aware Storage System

Making Cloud Storage Provenance-Aware

Storage Systems

PROVENANCE

Provenance-Aware Storage Systems

Provenance-Aware Storage Systems

Storage Systems

Transparently Gathering Provenance with Provenance Aware Condor

Provenance

Enabling Privacy in Provenance-Aware Workflow Systems

Provenance

FAUST: Fail-Aware Untrusted Storage

Storage-Aware Caching: Revisiting Caching for Heterogeneous Systems

Storage Systems

A Semantically-Enabled Provenance-Aware Water Quality Portal