380 likes | 503 Views
ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications. Evangelos Vlachos , Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry. Software Errors & Analysis Tools. Errors abundant in parallel software
E N D
ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry
Software Errors & Analysis Tools • Errors abundant in parallel software • Program crashes/vulnerabilities, limited performance • Three main categories of analysis tools • Checking before, during or after program execution • Instruction-grain Lifeguards • Online detailed analysis, but with high overhead • Several tools available, but mostly support for single-threaded code ParaLog: a framework for efficient analysis of parallel applications ASPLOS '10 - ParaLog
Lifeguards and Parallel Applications Application Threads Timesliced Execution & Analysis Parallel Execution & Analysis Butterfly Analysis ParaLog Time (previous talk) (this talk) windows of uncertainty precise application order DBI tools available today • high overhead due • to serialization - some false positives +software-based - new hardware required +no false positives +even better performance
Low-Overhead Instruction-level Analysis [Chen et. al., ISCA’08] lifeguard application online monitoring platform lifeguard thread event capturing event delivery applicationthread metadata event stream Application core Lifeguard core accelerators: IT, IF, MTLB add_handler(){ i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error(); } add r1 r2, r4 add, r1, r2, r4 ASPLOS '10 - ParaLog
Challenges in Parallel Monitoring [ParaLog] online parallel monitoring platform lifeguard application global metadata lifeguard thread 1 event capturing event delivery applicationthread 1 event stream accelerators: IT, IF, MTLB lifeguard thread k event delivery event capturing applicationthread k event stream accelerators: IT, IF, MTLB ASPLOS '10 - ParaLog
Addressing the Challenges [ParaLog] • Application event ordering • Ensuring metadata access atomicity efficiently • Parallelizing hardware accelerators online parallel monitoring platform lifeguard application global metadata lifeguard thread 1 event capturing event delivery applicationthread 1 event stream order enforcing application-onlyorder capturing accelerators: IT, IF, MTLB accelerators: IT, IF, MTLB dependence arcs lifeguard thread k event delivery event capturing applicationthread k event stream order enforcing application-onlyorder capturing accelerators: IT, IF, MTLB accelerators: IT, IF, MTLB ASPLOS '10 - ParaLog
Outline • Introduction • Addressing the Challenges of Parallel Monitoring • Capturing & enforcing application event ordering • Ensuring metadata access atomicity • Parallelizing hardware accelerators • Evaluation • Conclusions ASPLOS '10 - ParaLog
Event Ordering: the Problem • Case Study: Information flow analysis (i.e., Taintcheck) Lifeguard Application thread j thread k thread j thread k Application Time Lifeguard Time st_handler(A) store(A) load(A) ld_handler(A) Expose happens-before information to lifeguards ASPLOS '10 - ParaLog
Event Ordering: the solution (1/2) Application Lifeguard • Coherence-based ordering of application events • Similar to FDR, but online, focusing on application-only events thread j thread k thread j thread k Time tj- 1 tj store(A) st_handler(A) tj+ 1 tk - 1 wait while progressj < tj tk load(A) {thread j, tj} {thread j, tj} tk+ 1 ld_handler(A) progressj: tj- 1 progressj: tj progressk: tk progressk: tk- 1 progressj: tj- 2 progressk: tk- 2 ASPLOS '10 - ParaLog
Is monitoring coherence enough? Event Ordering: the Solution (2/2) Application Lifeguard • Previous work has not solved the problem of Logical Races • Both logical races and system calls resolved with Conflict Alert messages thread j thread k thread j thread k Application Time Lifeguard Time free(A)start free(A) Metadata(A) free(A)end ld_handler(A) load(A) Logical Race Conflict Alert Message Dependence ASPLOS '10 - ParaLog
Metadata Atomicity • Frequent use of locking too expensive • # of instructions added & synchronization cost • Dependence arcs handle the majority of the cases • Sufficient conditions: • One-to-one data-to-metadata mapping • Application reads don’t become metadata writes • Enforcing dependence arcs race-free operation • Rest of the cases handled by acquiring a lock • Lock used only in the load_handler(); other handlers safe (more details in the paper) ASPLOS '10 - ParaLog
Parallel Hardware Accelerators • Speed-up frequent lifeguard actions • Metadata-TLB;fast metadata address calculation • Idempotent Filters; filter out redundant checking • Inheritance Tracking; fast tracking of dataflow paths • Accelerators have only local view of the analysis • Cache locally analysis information (e.g., frequent events) • Important events have application-wide effects (e.g., free()) • Coherence-like issues with accelerators’ local state • Important events accompanied by Conflict Alerts • Use Conflict Alerts to flush accelerators’ state ASPLOS '10 - ParaLog
Outline • Introduction • Addressing the Challenges of Parallel Monitoring • Capturing & enforcing application event ordering • Ensuring metadata access atomicity • Parallelizing hardware accelerators • Evaluation • Conclusions ASPLOS '10 - ParaLog
Experimental Framework • Log-Based Architectures framework • Simics full-system simulation • CMP system with {2, 4, 8, 16} cores • {1, 2, 4, 8} of application and lifeguard threads • Sequentially Consistent memory model • Benchmarks and multithreaded Lifeguards used • SPLASH-2 and PARSEC • TaintCheck: Information flow tracking; accelerated by M-TLB, IT • AddrCheck: Memory access checking; accelerated by M-TLB, IF • Comparison with Timesliced Monitoring ASPLOS '10 - ParaLog
Performance Results: AddrCheck Normalized to sequential, unmonitored 8 app/lifeguard threads 16 cores total ASPLOS '10 - ParaLog
Performance Results: AddrCheck ASPLOS '10 - ParaLog
Performance Results: AddrCheck 15.4 1.9 9.5 6.1 6.7 2.9 2.3 2.1 6.2 1.9 2.4 1.7 • Timesliced Monitoring is not scalable • On average 15x slowdown over No Monitoring (8 threads) ASPLOS '10 - ParaLog
Performance Results: AddrCheck • Highest overhead with 8 threads: SWAPTIONS 6x • Lowest overhead with 8 threads: < 5% • Average overhead with 8 threads: 26% ASPLOS '10 - ParaLog
Performance Results: TaintCheck ASPLOS '10 - ParaLog
Performance Results: TaintCheck 10 4.6 1.7 2.9 2.1 12.9 11.5 15.7 1.9 6.6 1.9 2.4 2.8 1.7 • Timesliced Monitoring is not scalable • On average 23x slowdown over No Monitoring (8 threads) ASPLOS '10 - ParaLog
Performance Results: TaintCheck • Highest overhead with 8 threads: BARNES 2.6x • Lowest overhead with 8 threads: LU 5% • Average overhead with 8 threads: 48% ASPLOS '10 - ParaLog
Other Results in the Paper • Order capturing and order enforcing under TSO • Performance Impact of Lifeguard Accelerators • AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x] • A less expensive order capturing mechanism gets similar performance results • 1 timestamp per core vs. 1 timestamp per cache block ASPLOS '10 - ParaLog
Conclusions • ParaLog: Fast and precise parallel monitoring • Components of event ordering • Normal memory accesses: monitor coherence activity • Logical Races; use of Conflict Alert messages • Metadata Atomicity • Enforcing dependence arcs ensures atomicity (most cases) • Parallel Hardware Accelerators • Flush local state on remote events (Conflict Alert) • Average overhead is relatively low • AddrCheck: 26% and TaintCheck: 48% (8 threads) ASPLOS '10 - ParaLog
Questions ? ASPLOS '10 - ParaLog
Backup Slides ASPLOS '10 - ParaLog
Metadata Atomicity LockSet • Synchronization-free fast path vs. slow path • Concurrent application reads; no ordering available! • Concurrent metadata reads: follow the fast-path • Concurrent metadata writes: follow slow-path acquiring a lock • Concurrent metadata read and write: read may get either value • In any other case dependence arcs are available AddrCheck TaintCheck MemCheck ASPLOS '10
Parallel Hardware Accelerators • Accelerators have only local view of the analysis • Important events have system-wide effects • Case study: Idempotent Filters and AddrCheck ✖ ✔ ✔ ✖ ✔ ✔ Delivered to lifeguard Builds on Remote Conflict Messages LG 0 free(A) R(A) IF ✖ Redundant; discarded R(A) R(B) R(A) R(A) LG 1 Flush IF filters Flush local and remote IF filters IF R(A) R(B) R(C) R(A) free(A) ✖ ✔ ✔ ✔ • Details for parallel M-TLB and IT can be found in the paper ASPLOS '10 - ParaLog
Performance Impact of Lifeguard Accelerators • Accelerators provide a major speedup [2x – 9x] 11.3 6.8 7.3 9.4 ASPLOS '10 - ParaLog
Performance Impact of Lifeguard Accelerators • Accelerators provide a major speedup [1.13x – 3.4x] ASPLOS '10 - ParaLog
Transitive Reduction Sensitivity Study • Limited transitive reduction • No major performance impact; savings in chip area ASPLOS '10 - ParaLog
Supporting Total Store Order (TSO) • Cycle of dependencies in relaxed memory models • TSO relaxes the RAW ordering • Previous work (RTR): maintain versions of data • Identify SC offending instructions; save loaded value • This paper: maintain versions of metadata Memory Order: Commit order Thread 0 Thread 1 Log 0Log 1 Lifeguard 0 produce_version(v1,A) P(v1, A) P(v0, B) 0 Wr(A) Wr(B) Wr(A) Wr(B) store_handler(A) 1 C(v0, B) C(v1, A) wait_until_available(v0,B) 2 Rd(B) Rd(A) Rd(B, v0) Rd(A, v1) load_handler(B, v0) ASPLOS '10 - ParaLog
Parallel Hardware Accelerators • Speed-up frequent lifeguard actions • Fast metadata address calculation – Metadata-TLB • Fast tracking of data-flow paths – Inheritance Tracking • Filter out redundant checking – Idempotent Filters • Per-instruction checking gives the same result; cache event • Accelerators have only local view of the analysis • Important events have system-wide effects (e.g., free()) • Coherence-like issues with accelerators’ local state • Important events accompanied by Conflict Alerts • Use Conflict Alerts to flush state and deliver pending events ASPLOS '10 - ParaLog
Experimental Framework ASPLOS '10 - ParaLog
Relative Slowdown - TaintCheck ASPLOS '10 - ParaLog
Relative Slowdown - AddrCheck 3.0 6.0 ASPLOS '10 - ParaLog
Performance Results - AddrCheck 15.4 1.9 9.5 6.1 6.7 2.9 2.3 2.1 6.2 1.9 2.4 1.7 ASPLOS '10 - ParaLog
Performance Results - TaintCheck 10 4.6 1.7 2.9 2.1 12.9 11.5 15.7 1.9 6.6 1.9 2.4 2.8 1.7 ASPLOS '10 - ParaLog
Parallel Hardware Accelerators • Speed-up frequent lifeguard actions • Metadata-TLB & Inheritance Tracking (discussed in the paper) • Idempotent Filters; identify and filter out redundant checking • Per-instruction checking gives the same result • Cache incoming event and local state to identify redundancy • Accelerators have only local view of the analysis • Important events have application-wide effects (e.g., free()) • Coherence-like issues with accelerators’ local state • Important events accompanied by Conflict Alerts • Use Conflict Alerts to flush accelerators’ state ASPLOS '10 - ParaLog