330 likes | 455 Views
Provenance for the Cloud (USENIX Conference on File and Storage Technologies(FAST `10)). Kiran -Kumar Muniswamy -Reddy, Peter Macko , and Margo Seltzer Harvard School of Engineering and Applied Sciences. Outline. Introduction Background Provenance System Property Architecture & Protocol
E N D
Provenance for the Cloud(USENIXConference on File and Storage Technologies(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard School of Engineering and Applied Sciences
Outline • Introduction • Background • Provenance System Property • Architecture & Protocol • Evaluation • Conclusion & Comment
Introduction • Problem to Solve • Implement a provenance aware storage system in current cloud stores ( use Amazon )
Background(1/3) • Provenance • Data has two critical components • What it is ( contents ) • Where it came from ( ancestry ) • The provenance is the description of how the object was derived. • The metadata that describes the history of an object • Why use provenance? • Use case – Slogan Digital Sky Survey (SDSS) • Debug Experimental Results • Detect and Avoid Faulty Data Propagation • Improving Text Search Result • Security
Background(2/3) • Provenance can be abstract defined as a directed acyclic graph ( DAG ) • Nodes • objects : files, processes, tuples, data sets, etc • Have attributes • Command line arguments • Name and Version number • Edges • Indicate a dependency between the objects
I3 Patient Brain Death Notification is based on I4 I5 I1 Donor Data Request I8 Decision Request Donation Decision Donor Data I9 Data Collection Request is caused by is response to is response to is based on is caused by is justified by I7 I6 I2 Blood Test Request Blood Test Result is based on Blood Test Request Justification Report is response to is caused by
Background(3/3) • Eventual Consistency • A weaker form of data consistency • During a sufficient long period of time, and no updates are sent, we can expect that all replicas in system will be consistent
Provenance System Property(1/2) • Provenance Data Coupling • An object and its provenance must match • The provenance must accurately and completely describe the data • Multi-object Causal Ordering • The causal relationship among objects • A system must ensure that an object’s ancestors and their provenance are persistent before making the object itself persistent
I3 is based on Patient Brain Death Notification I4 I5 I1 Donor Data Request I8 Decision Request Donation Decision is based on Donor Data I9 Data Collection Request is caused by is response to is response to is based on is caused by is justified by I7 I6 I2 Blood Test Request Blood Test Result Blood Test Request Justification Report is response to is caused by
Provenance System Property(2/2) • Data Independent Persistence • Ensure a system retain an object’s provenance, even if the object is removed • Efficient Query • Be accessible to users who want to access or verify provenance properties of their data
Architecture(2) – S3 • Simple Storage Service(S3) • Amazon’s storage service • An object store where the size of objects can range from 1 byte to 5GB • With each objects, clients can store up to 2KBof metadata • Use SOAP or REST API • PUT, GET, HEAD, COPY, DELETE
Architecture(3) - SimpleDB • SimpleDB • An Amazon’s service that provides the functionality of indexing and querying data • Data model consist items that are described by <attribute,value> pairs • Each item can have 256 <attribute,value> pairs • Each attribute name and value can be as large as 1KB
Architecture(4) - SQS • Simple Queueing Service • Distributed messaging system that allows users to exchange messages between various distributed components in their systems • 8KB limit of the size of the message • In this paper, SQS is used as a write-ahead log(WAL)
Architecture(5) -- PASS • Provenance-Aware Storage System • A storage system that automatically collects , stores., manages, and provides search for provenance • Monitor system calls • Generate provenance and sending both provenance and data to PA-S3fs
Architecture(6) – PA-S3fs • Provenance Aware S3 File System • Caches data and provenance on the client to reduce traffic to S3 • Send data and provenance to the cloud
Protocol(2) • Protocol 1 ( P1 ) • Standalone Cloud Store • Map each file to an S3 object and store the provenance as a separate S3 object • Provenance object • Named with a uuid • Contain the name of primary object • Primary object metadata • Version number and uuid
Protocol(3) Client S3 • P1 does not support data coupling • But can detect decoupling • Query is inefficient • Need retrieve all provenance PUT:Provenance OK PUT:Data OK
Protocol(5) • Protocol 2 ( P2 ) • Cloud store with a cloud database • Store provenance as one SimpleDB item • If item is larger than 1KB SimpleDB limit • store provenance as S3 object • save the pointer in attribute-value
Protocol(6) Client S3 • Provide efficient provenance queries • Does not support data coupling PUT: Prov > 1KB OK SimpleDB BatchPUTAttributes: Prov OK PUT:Data OK
Protocol(7) • Protocol 3 ( P3 ) • Cloud store with Cloud Database and Messaging Service • Use SQS as a write-ahead log (WAL) • 8KB limit • Store large objects as temporary S3 objects , and record the pointer in WAL • Commit daemon • Read the log records • Assemble all the records belonging to a transaction • Ignore the records if the client crash
Client S3 PUT: Temp data copy OK SQS SendMessage: Prov Commitd S3 OK RecvMessage PUT:Prov>1KB OK SimpleDB BatchPUTAttributes OK S3 Copy:Data OK Delete:Msg Delete:temp OK OK
Evaluation(1) • Workload • CVSROOT nightly backup • IO intensive • 240 operations • Blast • Mix of compute and IO operations • Provenance tree has a depth of 5 • 10773 operations • Challenge • Mix of compute and IO operations • Provenance tree has a depth of 11 • 6179 operations
Evaluation(2) EC2 instance Local machine
Evaluation(3) • Query performance • Q1 • Retrieve all the provenance ever recorded • Q2 • Retrieve the provenance of all version of one object • Q3 • Find all files that were directly output by Blast • Q4 • Find all the descendants of files derived from Blast
Conclusion • Definition of properties that provenance systems must exhibit • Design and implementation of three protocols for storing provenance and data on the cloud • All three protocols have reasonable overhead in time and minimal financial overhead
Comment • Economy • Provenance can not increase profit directly • Customer loyalty • Security • Provenance can ensure correctness of files • But it may contain sensitive information