1 / 23

Zheng Chen and Luc Moreau zc05r@ecs.soton.ac.uk L.Moreau@ecs.soton.ac.uk University of Southampton

Implementation and Evaluation of a Protocol for Recording Process Documentation in the Presence of Failures. Zheng Chen and Luc Moreau zc05r@ecs.soton.ac.uk L.Moreau@ecs.soton.ac.uk University of Southampton. Outline. Motivation Protocol Overview Implementation Experimental Setup

brinly
Download Presentation

Zheng Chen and Luc Moreau zc05r@ecs.soton.ac.uk L.Moreau@ecs.soton.ac.uk University of Southampton

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation and Evaluation of a Protocol for Recording Process Documentation in the Presence of Failures Zheng Chen and Luc Moreau zc05r@ecs.soton.ac.uk L.Moreau@ecs.soton.ac.uk University of Southampton

  2. Outline • Motivation • Protocol Overview • Implementation • Experimental Setup • Experimental Results & Analysis • Conclusions & Future Work

  3. The provenanceof a data product refers to the process that led to that data product • Process documentation is a computer-based representation of a past process for determining provenance • Process documentation consists of a set of p-assertions • Process documentation is stored in provenance stores • Provenance obtained by querying provenance stores

  4. Actor1 Actor2 Actor3 Actor4 invocation invocation invocation result result result PS1 PS2 PS3 PS4 Invocation and result p-assertions Link Pointer Chain PReP (Groth 04-08) • A protocol to record process documentation • Multiple provenance stores are interlinked to enable retrievability of distributed process documentation

  5. Actor1 Actor2 Actor3 Actor4 invocation invocation invocation result result result PS1 PS2 PS3 PS4 Invocation and result p-assertions Broken Pointer Chain Link Failures • Provenance store crash, communication failures • We do not consider application failures, e.g. actor crash • Poor quality process documentation Incomplete Disconnected

  6. Requirements • Guaranteed Recording After a process completes, the entire documentation of the process must eventually be recorded in provenance stores • Link Accuracy All the links recorded during a process must eventually be accurate to enable retrievability of distributed documentation • Efficient Recording The protocol should be efficient and introduce minimum overhead

  7. F-PReP • A protocol for recording process documentation in the presence of failures • Derives from PReP to inherit its generic nature • Introduces an Update Coordinator to facilitate updating links(We assume the coordinator does not crash) • Actor’s side • Uses timeout and retransmission to record p-assertions • Chooses alternative provenance stores in case of failures • Requests the coordinator to update links • Provenance store Replies an acknowledgement only after it has successfully recorded p-assertions in its persistent storage.

  8. Update Update Coordinator PS2’ Repair Request Update Actor1 Actor2 Actor3 Actor4 invocation invocation invocation result result result PS1 PS2 PS3 PS4 Invocation and result p-assertions Link Pointer Chain F-PReP

  9. Implementation • Provenance Store • Implemented as a Java Servlet • backend store (Berkeley DB) • Disk cache Flushing OS buffers to disk before providing an ack to actor • Update Plug-In • Client Side Library • Remedial actions that cope with failures • Multithreading for the creation and recording of p-assertions • A local file store (Berkeley DB) for temporarily maintaining p-assertions • Update Coordinator • Implemented as a Java Servlet • Berkeley DB is also employed to maintain request information

  10. Performance Study • Throughput of provenance store and coordinator • Scalability of update coordinator • Failure-free recording performance • Overhead of taking remedial actions • Performance impact on application

  11. Experimental Setup • Iridis cluster (Over 1000 processor-cores) • Gigabit Ethernet • Tomcat 5.0 container • Berkeley DB Java Edition database • Java 1.5 • A generator is used on an actor's side to inject random failure events: • Failure to submit a batch of p-assertions to a provenance store • Failure to receive an acknowledgement from a provenance store before a timeout • Generates a failure event based on a failure rate, i.e., the number of failure events occurring after a total number of recordings

  12. 1. Provenance Store (PS) Throughput • Setup: up to 512 clients sending 10k p-assertions to 1 PS in 10 min • Hypothesis: Disk cache may sacrifice a provenance store's throughput. • Result:20% decrease in throughput

  13. 2. Coordinator Throughput • Setup: up to 512 clients sending 100 requests to 1 coordinator in 10 min • Hypothesis: The coordinator’s throughput is high. • Result:30,000*100 repair requests accepted in 10 min

  14. A client sends at most 200*100 repair requests. (Maximum is seen when failure rate is 50%.) Coordinator throughput: 30,000*100 req/10min This implies that coordinator can support a large number of clients (50 - 100?) without being a bottleneck. 3. Throughput Experiment with Failures (1 client) • Setup: 1 client sending 10k p-assertions to 1 PS 1 alt. PS and 1 coordinator used in the case of failures • Hypothesis: (a)Resending to a same PS is preferred over alt. PS • for transient failures • (b) Update coordinator is not a bottleneck.

  15. 128 clients send at most 750*100 repair requests. (Maximum is seen when failure rate is 50%.) Coordinator throughput: 30,000*100 req/10min This implies that coordinator can support a large number of clients without being a bottleneck. 4. Throughput Experiment with Failures (128 clients) • Setup: 128 clients sending 10k p-assertions to 1 PS • 1 alt. PS and 1 coordinator used in the case of failures • Hypothesis: (a)Resending to a alt. PS is preferred to same PS • (b) The coordinator is not a bottleneck.

  16. 5. Failure-free Recording Performance • Setup: 1 client recording 10,000 10k p-assertions to 1 PS • 100 p-assertions shipped in a single batch • Hypothesis: Disk cache causes overhead. • Results: (a) 900 10k p-assertions may be lost if PS’s OS crashes. (PReP) • (b)13.8% overhead, compared to PReP

  17. 6. Overhead of Taking Remedial Actions • Setup: 1 client recording 100 p-assertions to 1 PS • 1 alt. PS and 1 coordinator used in the case of failures • Hypothesis: Remedial actions have acceptable overhead. • Result: <10% overhead, compared to failure-free record time

  18. 7. Performance Impact on Application • Amino Acid Compressibility Experiment (ACE) • High performance and fine grained, thus representative • One run of ACE: 20 parallel jobs; 54, 000 interactions/job • Extremely detailed process documentation • 1.08 GB p-assertions/job in 25 minutes

  19. Recording Performance in ACE • Setup: 5 PS and 1 coordinator • Multithreading for creation and recording p-assertions • Hypothesis: F-PReP has acceptable recording overhead. • Results: (a) similar overhead (12%) as PReP on application performance when no failure occurs • (b)Timeout and queue management affect performance.

  20. Impact of Queue Management on Performance • Hypothesis: Flow control on queue affects performance. • Conclusions: (a) The result supports our hypothesis. • (b) We can monitor queue and take actions, • e.g., employing the local file store.

  21. 8. Quality of Recorded Process Documentation • Setup: Using F-PReP and PReP to record p-assertions Querying PS to verify recorded documentation • Results: (a) PReP: incomplete; F-PReP: complete • (b)PReP: irretrievable; F-PReP: retrievable

  22. Conclusions & Future Work • Coordinator does not affect an actor’s recording performance. • In an application, F-PReP has similar recording overhead as PReP on application performance when there is no failure. • Although it introduces overhead in the presence of failures, we believe the overhead is still acceptable, given that it can record high quality (i.e., complete and retrievable)process documentation. • We are currently investigating how to create process documentation when an application has its own fault tolerance schemes to tolerate application level failures. • In future work, we plan to make use of the process documentation recorded in the presence of failures to diagnose failures.

  23. Questions? Thank you!

More Related