Issues in Automatic Provenance Collection

Issues in AutomaticProvenance Collection May 4, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences

Imagine … • Every computational object you created had complete provenance. • You could identify the source and history of every object you ever received. • You could query this complete history. • All these features worked regardless of what tools you used.

What is Automatic Provenance Collection? • A system that makes all this happen. • Requires no user intervention. • Provenance collection is the default. • Works seamlessly and unobtrusively while you work using whatever tools you normally use. • Examples • Is this memo based upon confidential data? • Why do these two invocations produce different results?

Provenance-AwareStorage Systems (PASS) • Storage systems (e.g., file systems) in which provenance is a first class entity. • Provenance: • is generated and maintained as transparently as possible. • can be indexed and queried. • will be created from objects imported from non-PASS sources. • is maintained in the presence of deletes, copies, renames, etc.

env=“USER…” argv=“sort a” process name=“sort modules=“pasta…” kernel=“Linux…” File cache Collecting Provenance Kernel % sort a > b fork open b (W) exec “sort a” open a (R) read a write b close a close b input=sort input=a sort b a To file system

Observed vs Disclosed Provenance • Observed provenance: • Extract provenance from stream of events • System does not control events • Disclosed provenance: • Application or user identifies provenance • Provenance disclosed to database • Examples: • User annotations • Provenance-aware applications • Workflow specifications

Challenges in Observed Provenance • Granularity • Versions • Cycles • False provenance • Security

Granularity • Automatic systems track provenance at the granularity that they see (files, tuples, …). • Users think about provenance in coarser, semantically meaningful terms (experiments, projects, workflows). • This mismatch leads to problems: • Users want to know about “gcc 4.0,” not its change history from the beginning of time.

P Versions • Provenance + mutable data = versioning. • Consider: • Open A, read A, write A, close A • A’s provenance changed. • We implicitly created a new version. • The provenance system must preserve versions. • Avoiding excessive versions leads to … Read A A A’ Write A

Cycles • Cannot really happen: A cannot be both B’s parent and B’s child. • Violate causality. • So, how do they happen? • Open A, read A, write A, close A • A is its own parent, unless you create A’. • But what if (read A, write A) is in a loop? • One version per loop iteration? • Ideally, one version for entire loop. • How do you identify the loops?

P Cycles (2) Read A A Write A A’ Read A’ ?’

Cycles (3) • The cycles can be arbitrarily complex. • Why do they happen in observed systems and not disclosed systems? • In disclosed systems, the disclosures are made by someone who knows how to do the grouping. • Cycle detection/breaking is automatically doing what the human is doing when s/he decides where and what to disclose. • Our algorithm is not as smart as people.

False Provenance • Recorded provenance that did not affect the output. • Examples: • Many utilities read one or more start-up files, but not all those startup files affect every output. • A workflow might specify an input file that is only sometimes used. • Neither observed nor disclosed systems can avoid this completely.

Security • Provenance and the data it describes have different security characteristics. • Protecting provenance requires protecting: • Attributes (e.g., command line, environment) • Relationships (e.g., ancestors) • Composition of security is hard. • Unfortunately, it is a requirement.

Conclusions • Automatic collection is useful. • It is also challenging. • There is a ton of interesting research to do.

Questions! • Thanks to: • Network Appliance • IBM Research • The Harvard PASS Team: Uri Braun, Simson Garfinkel, David Holland, Kiran-Kumar Muniswamy-Reddy • Participants in the October, 2005 PASS Workshop • Our users! http://www.eecs.harvard.edu/~margo/syrah

Issues in Automatic Provenance Collection

Issues in Automatic Provenance Collection

Presentation Transcript

Issues in Automatic Musical Genre Classification

Changes in Automatic Fare Collection

Provenance

RUL COLLECTION BUDGET ISSUES

Automatic Collection “Recruiter”

Karma Provenance: Why and How? Provenance collection of unmanaged workflows PI: Dr. Beth Plale

PROVENANCE

Privacy Issues in Scientific Workflow Provenance

Provenance

Provenance-enabled Automatic Data Publication

“provenance”

Emerging Issues in Data Collection

Provenance

Automatic Data Collection: Server Logs

Prompt Automatic Milk Collection System

RUL COLLECTION BUDGET ISSUES

Provenance

Automatic Fare Collection System