340 likes | 465 Views
Programming, Debugging, Profiling and Optimizing Transactional Memory Applications. PhD Thesis Proposal. Department of Computer Architecture Universitat Politècnica de Catalunya – BarcelonaTech Barcelona Supercomputing Center. Ferad Zyulkyarov. 01 July 2010. Publications.
E N D
Programming, Debugging, Profiling and Optimizing Transactional Memory Applications PhD Thesis Proposal Department of Computer Architecture Universitat Politècnica de Catalunya – BarcelonaTech Barcelona Supercomputing Center Ferad Zyulkyarov • 01 July 2010
Publications • Ferad Zyulkyarov, SrdjanStipic, Tim Harris, Osman Unsal, Adrian Cristal, Ibrahim Hur, Mateo Valero, Discovering and Understanding Performance Bottlenecks in Transactional Applications, PACT'10 • Ferad Zyulkyarov, Tim Harris, Osman Unsal, Adrian Cristal, Mateo Valero, Debugging Programs that use Atomic Blocks and Transactional Memory, PPoPP'10 • Vladimir Gajinov, Ferad Zyulkyarov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory , ICS'09 • Ferad Zyulkyarov, Vladimir Gajinov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server , PPoPP’09 • Ferad Zyulkyarov, SanjaCvijic,Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, WormBench - A Configurable Workload for Evaluating Transactional Memory Systems, MEDEA '09 • Ferad Zyulkyarov, MilosMilovanovic, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Memory Management for Transaction Processing Core in Heterogeneous Chip-Multiprocessors, OSHMA '09 • MilosMilovanovic, Osman Unsal, Adrian Cristal, Ferad Zyulkyarov, Mateo Valero, Compiler Support for Using Transactional Memory in C/C++ Applications, INTERACT’07
Work Plan 12m 11m 21m 10m 15m 9.5m 7m 2m 01/10/2010
Transactional Memory atomic { statement1; statement2; statement3; statement4; ... }
The Big Questions • Is programming with TM easy? • Is TM competitive with locks? • Are existing development tools sufficient?
Atomic Quake • Parallel Quake game server • All locks are replaces with atomic blocks • 27,400 LOC of C code in 56 files • Rich transactional application • 63 atomic blocks • Rich uses of atomic blocks • Library calls, I/O, error handling, memory allocation, failure atomicity • Various transactional characteristics • A workload to drive research in TM
Is programming with TM easy? • Yes. • In large applications where we have many shared objects and want to provide efficient fine grain synchronization • Example: region based locking in tree data structure and graphs.
Where Transactions Fit? Guarding different types of objects with separate locks. 1 switch(object->type) { /* Lock phase */ 2 KEY: lock(key_mutex); break; 3 LIFE: lock(life_mutex); break; 4 WEAPON: lock(weapon_mutex); break; 5 ARMOR: lock(armor_mutex); break 6 }; 7 8 pick_up_object(object); 9 10 switch(object->type) { /* Unlock phase */ 11 KEY: unlock(key_mutex); break; 12 LIFE: unlock(life_mutex); break; 13 WEAPON: unlock(weapon_mutex); break; 14 ARMOR: unlock(armor_mutex); break 15 }; Lock phase. atomic { pick_up_object(object); } Unlock phase.
Is TM Competitive to Locks? • No. • 4-5x slowdown on single threaded version. • But it is promising to be competitive because of the obtained good scalability. Scales OK up to 4 threads. Sudden increase in aborts.
Are Existing Tools Sufficient? • No • We need: • Richer language level primitives and integration. • Mechanisms to handle I/O. • Dynamic error handling. • Debuggers. • Profilers.
Unstructured Use of Locks Atomic Block 1 boolfirst_if = false; 2 boolsecond_if = false; 3 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 4 <statements1> 5 atomic { 6 <statemnts2> 7 if (!c->send_message) { 8 <statements3> 9 first_if = true; 10 } else { 11 <stamemnts5> 12 if (!sv.paused && !Netchan_CanPacket(&c->netchan)){ 13 <statmenets6> 14 second_if = true; 15 } else { 16 <statements8> 17 if (c->state == cs_spawned) { 18 if (frame_threads_num > 1) { 19 atomic { 20 <statements9> 21 } 22 } else { 23 <statements9>; 24 } 25 } 26 } 27 } 28 } 29 if (first_if) { 30 <statements4>; 31 first_if = false; 32 continue; 33 } 34 if (second_if) { 35 <statements7>; 36 second_if = false; 37 continue; 38 } 39 <statements10> 40 } Locks 1 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 2 <statements1> 3 LOCK(cl_msg_lock[c - svs.clients]); 4 <statemnts2> 5 if (!c->send_message) { 6 <statements3> 7 UNLOCK(cl_msg_lock[c - svs.clients]); 8 <statements4> 9 continue; 10 } 11 <stamemnts5> 12 if (!sv.paused && !Netchan_CanPacket (&c->netchan)) { 13 <statmenets6> 14 UNLOCK(cl_msg_lock[c - svs.clients]); 15 <statements7> 16 continue; 17 } 18 <statements8> 19 if (c->state == cs_spawned) { 20 if (frame_threads_num > 1) LOCK(par_runcmd_lock); 21 <statements9> 22 if (frame_thread_num > 1) UNLOCK(par_runcmd_lock); 23 } 24 UNLOCK(cl_msg_lock[c - svs.clients]); 25 <statements10> 26 } Extra variables and code Solution explicit “commit” Complicated Conditional Logic
Various Transactional Characteristics Per-atomic block runtime statistics from Atomic Quake. Different execution frequency -> Phased behavior. Very small transactions Very large transactions Most frequent atomic block is read-only. Control flow does not reach all atomic blocks.
Debugging Transactional Applications • Existing debuggers are not aware of atomic blocks and transactional memory • New principles and approaches: • Debugging atomic blocks atomically • Debugging at the level of transactions • Managing transactions at debug-time • Extension for WinDbg to debug programs with atomic blocks
Atomicity in Debugging • Step over atomic blocks as if single instruction. • Abstracts weather atomic blocks are implemented with TM or lock inference • Good for debugging sync errors at granularity of atomic blocks vs. individual statements inside the atomic blocks. Non-TM Aware Debugger TM Aware Debugger <statement 1> <statement 2> atomic { <statement 3> <statement 4> <statement 5> <statement 6> } <statement 7> <statement 8> <statement 1> <statement 2> atomic { <statement 3> <statement 4> <statement 5> <statement 6> } <statement 7> <statement 8> Debugging becomes frustrating when transaction aborts.
Isolation in Debugging • What if we want to debug wrong code within atomic block? • Put breakpoint inside atomic block. • Validate the transaction • Step within the transaction. • The user does not observe intermediate results of concurrently running transactions • Switch transaction to irrevocable mode after validation. atomic { <statement 1> <statement 2> <statement 3> <statement 4> }
Debugging at the Level of Transactions • Assumes that atomic blocks are implemented with transactional memory. • Examine the internal state of the TM • Read/write set, re-executions, status • TM specific watch points • Break when conflict happens • Filters • Concurrent work with Herlihy and Lev [PACT’ 09].
TM Specific Watchpoints Filter: Break if Address = reservation@04 Thread = T2 Break when conflict happens AND atomic { <statement 1> <statement 2> <statement 3> <statement 4> } Conflict Information Conflicting Threads: T1, T2 Address: 0x84D2F0 Symbol: reservation@04 Readers: T1 Writers: T2
Managing Transactions at Debug-Time • At the level of atomic blocks • Debug time atomic blocks • Splitting atomic blocks • At the level of transactions • Changing the state of TM system (i.e. adding and removing entries from read/write set, change the status, abort) • Analogous to the functionality of existing debuggers to change the CPU state
Example Debug Time Atomic Blocks <statement 1> <statement 2> <statement 3> <statement 4> <statement 5> <statement 6> <statement 7> <statement 8> <statement 9> <statement 10> <statement 11> <statement 12> <statement 13> <statement 14>
Example Debug Time Atomic Blocks <statement 1> <statement 2> <statement 3> StartDebugAtomic <statement 4> <statement 5> <statement 6> <statement 7> <statement 8> <statement 9> EndDebugAtomic <statement 10> <statement 11> <statement 12> <statement 13> <statement 14> User marks the start and the end of the transactions
Issues of Profiling TM Programs • TM applications have unanticipated overheads • Problem raised by Pankratius [talk at ICSE’09] and Rossbach et al. [PPoPP’10] • Difficult to profile TM applications without profiling tools and without knowing the implementation of the TM system • Experience of optimizing QuakeTM, Gajinov et al. [ICS’2009]
Profiling TM Programs • Design principles • Report results at source language constructs • Abstract the underlying TM system • Low probe effect and overhead • Profiling techniques • Conflict point discovery • Identifying conflicting data structures • Visualizing transactions
Conflict Point Discovery • Identifies the statements involved in conflicts • Provides contextual information • Finds the critical path
Call Context increment() { counter++; } Thread 1 for (inti = 0; i < 100; i++) { probability80(); probability20(); } Bottom-up view + increment (100%) |---- probability80 (80%) |---- probability20 (20%) Top-down view + main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%) probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } } } Thread 2 for (inti = 0; i < 100; i++) { probability80(); probability20(); } probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } } }
Aborts Graph (Bayes) There are 15 atomic blocks and only one of them aborts most. Which atomic blocks cause AB3 to abort? AB1 AB2 Conf: 73% Wasted: 63% Conf: 20% Wasted: 29% AB3 72% of wasted work
Indentifying Conflicting Objects 1: List list = new List(); 2: list.Add(1); 3: list.Add(2); 4: list.Add(3); ... atomic { list.Replace(2, 33); } Per-Object View + List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%) List 1 2 3 0x08 0x10 0x18 0x20 GC Root 0x08 Object Addr 0x20 InstrAddr 0x446290 GC Memory Allocator DbgEng List.cs:1
Transaction Visualizer (Genome) Garbage Collection Wait on barrier Aborts occur at the first and last atomic blocks in program order.
Overhead and Probe Effect Process data offline or during GC. + Profiling Enabled - Profiling Disabled Normalized Execution Time Standard deviation for the difference 27% Abort Rate in % Standard deviation for the difference 3.88%
Optimization Techniques • Moving statements • Atomic block scheduling • Checkpoints and nested atomic blocks • Pessimistic reads • Early release
Moving Statements No! atomic { counter++; <statement1> <statement2> <statement3> } atomic { <statement1> <statement2> <statement3> counter++; } Will this code execute the same?
Checkpoints atomic { <statement1> <statement2> <statement3> <statement4> <statement5> <statement6> <statement7> } Conflicts 2% 15% 4% 79% Insert Checkpoint
Checkpoints atomic { <statement1> <statement2> <statement3> <statement4> <statement5> <statement6> <checkpoint> <statement7> } Conflicts 2% 15% 4% 79% Reduced wasted work for the atomic block with 40%. Insert Checkpoint
Conclusion • Study the programmability aspects of TM • New debugging principles and approaches for TM applications • New profiling techniques for TM applications • Profile-guided optimization approaches for TM applications