1 / 64

Kendo: Efficient Deterministic Multithreading in Software

Kendo: Efficient Deterministic Multithreading in Software. Marek Olszewski Jason Ansel Saman Amarasinghe Commit Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Example. Simple OpenMP Parallel Code:. double inv_sum = 0.0;

cerise
Download Presentation

Kendo: Efficient Deterministic Multithreading in Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kendo: Efficient Deterministic Multithreading in Software MarekOlszewski Jason Ansel SamanAmarasinghe Commit Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

  2. Example • Simple OpenMP Parallel Code: double inv_sum = 0.0; #pragmaomp parallel for reduction(+: sum) for (inti = 1; i < 10000000; i++) inv_sum += 1.0 / i; printf(“inv_sum: %.64g\n”, inv_sum);

  3. Example • Simple OpenMP Parallel Code: double inv_sum = 0.0; #pragmaomp parallel for reduction(+: sum) for (inti = 1; i < 10000000; i++) inv_sum += 1.0 / i; printf(“inv_sum: %.64g\n”, inv_sum);

  4. Example • Simple OpenMP Parallel Code: Threads double inv_sum = 0.0; #pragmaomp parallel for reduction(+: sum) for (inti = 1; i < 10000000; i++) inv_sum += 1.0 / i; printf(“inv_sum: %.64g\n”, inv_sum); Critical section Reduction Reduction Run 1: inv_sum: 16.69531126586006308798459940589964389801025390625 Run 2: inv_sum: 16.695311265860066640698278206400573253631591796875 Run 2: inv_sum: 16.695311265860066640698278206400573253631591796875

  5. Another Example Threads • Common parallel programming paradigm: • Radiosity (Singh et al. 1994) • LocusRoute (Rose 1988) • Delaunay Triangulation (Kulkarni et al. 2008) Global State Critical section data data data Non-commutative updates data data data Global data structure Lock Lock Lock Lock Locks Threads perform repeated well-synchronized updates to global state

  6. Another Example Threads • Non-deterministic internal states and output • Difficult to eliminate using today’s programming idioms Global State Critical section data data data data Non-commutative updates data data data data Global data structure Lock Lock Lock Lock Locks

  7. Non-Determinism • Hard to create programs with repeatable results • Determinism is often part of program specifications, eg: • Don’t want a verilog compiler to generate different circuits every time • Multi-threaded replicas in fault-tolerant systems must be deterministic ? ? ? ? ? ? ? ?

  8. Non-Determinism • Hard to create programs with repeatable results • Determinism is often part of program specifications, eg: • Don’t want a verilog compiler to generate different circuits every time • Multi-threaded replicas in fault-tolerant systems must be deterministic • Debugging becomes more difficult • Heisenbugs • Difficult to perform cyclic debugging • Common debugging method used for sequential programs • Testing offers weak guarantees • Will the code pass the test again? ? ? ? ? ? ? ? ?

  9. Deterministic Execution Model • Non-determinism causes many problems • Why do we put up with it? • Present parallel programmer with deterministic execution model • Interleave critical sections deterministically • Only allow one interleaving • Find a good interleaving that preserves the parallel performance 

  10. Token Algorithm Thread 1 Thread 2 Thread Progress Threads racing to acquire Lock A

  11. Token Algorithm Thread 1 Thread 2 Thread Progress Threads racing to acquire Lock A

  12. Token Algorithm Thread 1 Thread 2 Thread Progress Threads racing to acquire Lock A

  13. Token Algorithm Thread 1 Thread 2 Thread Progress

  14. Token Algorithm Thread 1 Thread 2 Thread Progress det_lock(A)

  15. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) wait_for_token() lock(A) pass_token()

  16. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) wait_for_token() lock(A) pass_token()

  17. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) wait_for_token() lock(A) pass_token()

  18. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) det_lock(A) wait_for_token() lock(A) pass_token() wait_for_token() lock(A) pass_token()

  19. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) det_lock(A) wait_for_token() lock(A) pass_token() wait_for_token() lock(A) pass_token()

  20. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) det_lock(A) wait_for_token() lock(A) pass_token() wait_for_token() lock(A) pass_token()

  21. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) det_lock(A) wait_for_token() lock(A) pass_token()

  22. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) det_lock(A) wait_for_token() lock(A) pass_token()

  23. Token Algorithm Thread 1 Thread 2 Token Thread Progress det_lock(A) wait_for_token() lock(A) pass_token() det_unlock(A) Guarantees that thread 1 will always acquire lock before thread 2

  24. Token Algorithm Thread 1 Thread 2 Token Thread Progress • Load imbalance! • Allow threads to pass token outside of critical sections • High overhead! • Too much serialization!

  25. What Do We Need? • Method of tracking thread progress • Must be deterministic • Must match true progress of thread in physical time as close as possible • Must be cheap to compute • Ability to pass the token in advance (before it is received) • Decouples threads

  26. Logical Time Algorithm • Each thread keeps a low overhead counter called its “Logical Clock” • Incremented often, and in a way that tries to match progress of thread as close as possible • Clocks collectively create a notion of “logical time” • Abstract counterpart to physical time • Threads take turns holding a “virtual token” • Thread’s turn when its clock is a global minimum • Thread passes the “virtual token” by incrementing its clock • Does not have to wait for its turn • Allows threads to execute asynchronously while outside of critical sections

  27. Logical Time Algorithm Thread 1 Thread 2 t=3 t=3 Physical Time Deterministic Logical Time Threads racing to acquire Lock A

  28. Logical Time Algorithm Thread 1 Thread 2 t=6 Physical Time t=8 Deterministic Logical Time Threads racing to acquire Lock A

  29. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=8 det_lock(A) t=18 Deterministic Logical Time

  30. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=8 det_lock(A) t=18 Deterministic Logical Time wait_for_turn(); lock(A);

  31. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=16 det_lock(A) t=18 Deterministic Logical Time wait_for_turn(); lock(A);

  32. Logical Time Algorithm Thread 1 Thread 2 Physical Time det_lock(A) t=18 t=20 Deterministic Logical Time wait_for_turn(); lock(A);

  33. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=20 t=22 Deterministic Logical Time

  34. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=22 det_lock(A) Deterministic Logical Time t=24 wait_for_turn(); lock(A);

  35. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=22 det_lock(A) Deterministic Logical Time t=24 wait_for_turn(); lock(A);

  36. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=22 det_lock(A) Deterministic Logical Time det_unlock(A) t=26 wait_for_turn(); lock(A);

  37. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=24 Deterministic Logical Time t=29

  38. Logical Time Algorithm Thread 1 Thread 2 t=3 t=3 Physical Time Deterministic Logical Time Threads racing to acquire Lock A

  39. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=6 t=8 Deterministic Logical Time Threads racing to acquire Lock A

  40. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=8 t=18 Deterministic Logical Time Threads racing to acquire Lock A

  41. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=16 t=22 det_lock(A) Deterministic Logical Time wait_for_turn(); lock(A);

  42. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=16 t=22 det_lock(A) Deterministic Logical Time wait_for_turn(); lock(A);

  43. Logical Time Algorithm Thread 1 Thread 2 Physical Time det_lock(A) t=18 t=22 det_lock(A) Deterministic Logical Time wait_for_turn(); lock(A); wait_for_turn(); lock(A);

  44. Logical Time Algorithm Thread 1 Thread 2 Physical Time det_lock(A) t=18 t=22 det_lock(A) Deterministic Logical Time wait_for_turn(); lock(A); wait_for_turn(); lock(A);

  45. Logical Time Algorithm Thread 1 Thread 2 Physical Time det_lock(A) t=18 t=22 det_lock(A) Deterministic Logical Time wait_for_turn(); lock(A); wait_for_turn(); lock(A);

  46. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=20 t=22 det_lock(A) Deterministic Logical Time wait_for_turn(); lock(A);

  47. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=22 det_lock(A) t=23 Deterministic Logical Time wait_for_turn(); lock(A);

  48. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=22 det_lock(A) t=23 Deterministic Logical Time wait_for_turn(); lock(A);

  49. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=22 det_lock(A) Deterministic Logical Time det_unlock(A) t=26 wait_for_turn(); lock(A);

  50. Logical Time Algorithm Thread 1 Thread 2 Physical Time t=24 Deterministic Logical Time t=29 Guarantees that thread 1 will always acquire lock before thread 2

More Related