320 likes | 385 Views
Using ATLAS for Performance Tuning and Debugging. Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University http://tcc.stanford.edu/prototypes. Tutorial Set-up. Wireless Router access SSID: RAMP-DEMO Passphrase: rampramp Team setup (Extreme Programming )
E N D
Using ATLAS for Performance Tuning and Debugging Sewook Wee and Njuguna Njoroge Computer Systems Laboratory Stanford University http://tcc.stanford.edu/prototypes
Tutorial Set-up • Wireless Router access • SSID: RAMP-DEMO • Passphrase: rampramp • Team setup (Extreme Programming) • One member = driver, will code on his or her laptop • Rest of team = passengers, will review and help driver • Server connection • ssh 10.0.0.2 • Username and password is on your desk • Environment variables check • check $BEE2_BOARD $VACATION $DLL • echo $BEE2 $VACATION $DLL • Make sure that your favorite text editor is working properly • VNC viewer • VNC Viewer executable • http://www.realvnc.com/cgi-bin/download.cgi • Open up VNC in shared mode
Transactionalizing vacation • vacation – Part of TCC group’s STAMP benchmark suite • STAMP = Stanford’s Transactional Applications for Multi-Processing • http://stamp.stanford.edu • Modeled after SPECjbb2000 • About vacation … • Implements travel reservation system powered by database • Workload consists of clients interacting with DB manager • Four tables in DB: cars, rooms, flights, and customers • The table of customers tracks the reservations and total price • The tables are implemented as Red-Black trees
Running vacation … • Let’s run vacation in its original form ACTION! %> cd $VACATION %> make run_seq
vacation pseudocode – main In vacation.c, function MAIN starting on line 340 initializeManager; initializeClients; PROFILER_ON; client_run; PROFILER_OFF;
vacation pseudocode – client work In client.c, function client_run starting on line 109 for (i = 0; i < numOperation; i++) { action = selectAction; switch (action) { case ACTION_MAKE_RESERVATION: // 1st case for(j = 0; j < numQueries; j++) switch(query_type) { …} case ACTION_BILL_CUSTOMER: // 2nd case case ACTION_UPDATE_TABLES: // 3rd case for(j = 0; j < numUpdates; j++) switch(update_type){ …} } } ...
Verify success of sequential run ACTION! %> more trace/0/atlas.stdout Initializing manager... done. Manager Stats are initialized Initializing clients... done. Transactions = 1024 Clients = 1 Transactions/client = 1024 Queries/transaction = 1 Relations = 4096 Query percent = 99 Query range = 4055 Percent user = 80 Running clients... done. Checking tables... done. Deallocating memory... Number of total adds = 24700 Number of total deletes = 56 Number of total queries = 3341 Number of total reservations = 1618 Number of total cancellations = 0 Done.
Quick overview TCC API • TM_PARALLEL(function_ptr, arg_ptr, numThreads) • function_ptr = pointer to parallel function • arg_ptr = pointer to function’s arguments • numThreads = for TCC number of CPUs • TM_BEGIN(), TM_END() • Indicate start and end of a transaction • TM_GET_THREAD_ID(), TM_GET_NUM_THREAD() • Retrieve thread’s ID and number of threads • High-level language, OpenTM (resembles OpenMP), is in the works
Transacationalizing vacation – Step 1 • OPENvacation.c, CHANGE line 362 to: 358 MEMORY_INIT(); 359 PROFILER_ON(); 360 361 /* Run transactions */ 362 TM_PARALLEL(client_run, (void*)clients, global_params[PARAM_CLIENTS]); 363 CHANGE • Note: global_params[PARAM_CLIENTS] = Number of Processors • command-line parsing code sets this value
Transacationalizing vacation – Step 2 • OPEN client.c • ADD #include "tm.h“ on line 19 16 #include "client.h" 17 #include "manager.h" 18 #include "reservation.h" 19 #include "tm.h” 20 ADD • ADD int myId = TM_GET_THREAD_ID();to line 113 and CHANGE line 115clients[0] clients[myId] 112 int i; 113 int myId = TM_GET_THREAD_ID(); 114 client_t** clients = (client_t**)(args); 115 client_t* clientPtr = clients[myId]; ADD CHANGE
Transacationalizing vacation – Step 3 • Still in client.c , ADD TM_BEGIN(); on line 129 124 for (i = 0; i < numOperation; i++) { 125 126 int r = random_generate(randomPtr) % 100; 127 action_t action = selectAction(r, percentUser); 128 129 TM_BEGIN(); 130 131 switch (action) { ADD • Still in client.c , ADD TM_END(); on line 242 239 } /* switch (action) */ 240 formattingAndProtocol(&i); 241 242 TM_END(); 243 ADD ACTION! %> make run_par
1-way ATLAS Polling BRAM... ***************************** Profiling Info from TCC[0/1] ***************************** TOTAL: 1339369387 PERF: 595385277 BUSY: 592922718 L1_MISS: 2451797 ARBIT: 1703 COMMIT: 9059 SYNC: 0 VIOL: 0 MISC: 0 ... OVFL CYCLE: 592071059 # OVFL: 98 # LRU OVFL: 98 # READ: 246260 # R-MISS: 10793 # WRITE: 108005 # W-MISS: 1866 # Inst.: 271417058 # Trans: 3 # Violation: 0 # ITLBMISS: 595 # DTLBMISS: 4775 # DStorage: 0 # SC: 0 ITLBCYCLE: 546605 DTLBCYCLE: 8177964 DS CYCLE: 0 SC CYCLE: 0 # SYS Inst.: 1554382 # SYS CYCLE: 8724569 # Timeout: 1394 # TimeoutL: 0 Profiling File Output Format • While vacation is running, open sequential run’s stats ACTION! %> more trace/0/atlas.log
Analyzing scalability of vacation • Look at reported speedup • Gets slower when we add more processors!! • Violations dominate PERF time! ACTION! %> less trace/8/atlas.log
Reading TAPE-violation report ACTION! • Let’s look at violation log file • Report says … lines 132 to 135 of manager.c should be largest offenders %> less trace/8/viol.log Read_PC Object_Addr Occurence Loss Write_Proc Line 10001500 100830e0 32 1265341 3 ..//vacation/manager.c:134 10001448 100830e0 29 766816 4 ..//vacation/manager.c:134 10001390 100830e0 30 6446858 1 ..//vacation/manager.c:134 10005f4c 304492e4 3 750669 6 ..//lib/rbtree.c:105 • In manager.c, go to above lines and examine code • Function increment_stats reads and writes global variables lots of conflicts between transactions • Incrementing these stats causes many violations
Fixing violations in vacation • The problem Violations on global stats variables • The fix privatize global variables • Simple privatization scheme • Make an 8-element array for each stat variable • i.e. int num_adds; int num_adds[MAX_CPUS]; • Each element is owned by a processor • i.e. num_adds[x] = Processor x’s element • In the stats printing function, aggregate the array elements into one single variable
Privatization of vacation – Step 1 • OPEN manager.c, CHANGE to lines 111-115 to: 110 #ifdef manager_stats 111 int num_adds[MAX_CPUS]; 112 int num_deletes[MAX_CPUS]; 113 int num_queries[MAX_CPUS]; 114 int num_reservations[MAX_CPUS]; 115 int num_cancels[MAX_CPUS]; 116 #endif CHANGE CHANGE • Then, for lines 132-136, CHANGE to: 130 switch (stat) 131 { 132 case ADDS: num_adds[TM_GET_THREAD_ID()]++; break; 133 case DELETES: num_deletes[TM_GET_THREAD_ID()]++; break; 134 case QUERIES: num_queries[TM_GET_THREAD_ID()]++; break; 135 case RESERVATIONS: num_reservations[TM_GET_THREAD_ID()]++; break; 136 case CANCELS: num_cancels[TM_GET_THREAD_ID()]++; break; 137 default: break; 138 } CHANGE CHANGE
Privatization of vacation – Step 2 • In function manager_initStats in manager.c, ADD to lines 153-155 and CHANGE lines 156-161: 149 void 150 manager_initStats(void) 151 { 152 #ifdef manager_stats 153 int i; 154 for(i = 0; i < MAX_CPUS; i++) 155 { 156 num_adds[i] = 0; 157 num_deletes[i] = 0; 158 num_queries[i] = 0; 159 num_reservations[i] = 0; 160 num_cancels[i] = 0; 161 } 162 #endif 163 164 printf("Manager Stats are initialized\n"); 165 166 } ADD ADD CHANGE CHANGE
Privatization of vacation – Step 3 • In manager_printStats function in manager.c, CHANGE line 177 and lines 192-196 176 #ifdef manager_stats 177 #if 1 178 int i; 179 int num_adds_t = 0, num_deletes_t = 0, num_queries_t = 0; 180 int num_reservations_t = 0, num_cancels_t = 0; 181 /* aggregate stats */ …. 191 printf("\n"); 192 printf("Number of total adds = %d\n", num_adds_t); 193 printf("Number of total deletes = %d\n", num_deletes_t); 194 printf("Number of total queries = %d\n", num_queries_t); 195 printf("Number of total reservations = %d\n", num_reservations_t); 196 printf("Number of total cancellations = %d\n", num_cancels_t); CHANGE CHANGE CHANGE ACTION! %> make run_par
Summary of Transactional vacation • After ~2 minutes, observe speedup at 8 processors is approx 6 times faster than uniprocessor configuration • Note: In OpenTM, privatization and reduction will be automated by flagging variables • Compiler will insert the privatization and reduction code for us • In this exercise, we demonstrated • Ease of use of transactional memory • Intuitive coarse-grain parallelization • Did not require low-level understanding of code • Guided performance tuning • Identifies significant performance bottlenecks • Without profiler and TAPE, finding such bottlenecks is like “looking for a needle in a haystack”!
Debugging Parallel Code • There are established techniques for debugging sequential code • Standard debugger (i.e. GDB) • How about parallel code? • Non-deterministic runtime behavior • Sometimes you have to understand underlying architecture • How can transactional memory help? • Atomicity & Isolation • No intrusion from other threads inside the transaction • Deterministic replay • Infinite data-watches • Our focus today => deterministic replay & GDB support
Functional Debugging of Transactional Apps • Once app is transactional, most common type of functional bug is atomicity violation • What are atomicity violations? • In TM, programmer splits an atomic region of code into two or more transactions • Intermediate values of shared data in one transaction prematurely exposed to other transactions • In fine-grain lock-based programming, much easier to introduce such violations • Challenge: Atomicity violations are non-deterministic and hard to regenerate
Fixing atomicity violations in ATLAS • In this session, you will debug an application with atomicity violations • ATLAS provides framework for deterministic replay • 1st Step: Run a small application with atomicity violations • 2nd Step: Deterministically regenerate the buggy execution • 3rd Step: Add monitor code to identify origins of bugs • 4th Step: Fix the code!
Example Code: Doubly Linked List • Toy example: Goal is to demonstrate the tool • Global doubly-linked-list queue • Head and Tail pointers • Each thread dequeues an item from the Head pointer, and enqueue it back to the Tail • Threads use dequeue and enqueue functions which are individually synchronized using transaction • Programmer’s intention: • The order of items in the queue remains same • Like one thread repeats dequeue and enqueue
High-level Pseudo code run: for i = 0:NUM_ITERS item = atomic_dequeue(); atomic_enqueue(item); End atomic_dequeue: TM_BEGIN() item = dequeue_from_Head TM_END() return item atomic_enqueue (item): TM_BEGIN() enqueue_to_Tail(item) TM_END() Head 1 2 3 4 5 6 Tail Thread A Thread B
Execution Step 1: Test Drive The result from ATLAS may or may not meet the spec. Here’s one possible example of undesired execution result. Try it several times. You will see different result. ACTION! %> cd $DLL %> make run Correct Output 0 1 2 3 4 5 6 7 Actual Result 0 1 2 4 3 5 6 7
Execution Step 2: Replay • Log file • After test drive, you will get atlas.stdout and commit.out • atlas.stdout: Standard output from the application • commit.out: transaction order • Now we will replay the previous run with commit.out ACTION! %> make replay LOGFILE=commit.out • Unlike previous step, you see the same behavior TIP %> make replay LOGFILE=commit.correct %> make replay LOGFILE=commit.error
Execution Step 3: Finding the bug • Hypothesis: Enqueuing order may be different from the dequeuing order. • For example, • How to test hypothesis? • Let’s make it simple: add printf Thread A dequeue X Thread A enqueue X Thread B dequeue Y Thread B enqueue Y Thread A dequeue X Thread B dequeue Y Thread B enqueue Y Thread A enqueue X
Execution Step 3: Finding the bug • Let’s add monitoring code • printf in the transaction will not affect the transaction order • Therefore, you will get the exactly same behavior 96 Head = next; 97 printf("Dequeue(%d)\n", item->id); 98 EDIT dll.c 108 109 printf("\t\tEnqueue(%d)\n", item->id); 110 item->prev = NULL; ACTION! %> make replay LOGFILE=commit.out You see that your hypothesis is right.
Execution Step 4: Fix it • Make dequeue/enqueue as one atomic block • Untransactionalize dequeue/enqueue • Transactionalize dequeue/enqueue pair 84 //TM_BEGIN(); ... 99 //TM_END(); ... 107 //TM_BEGIN(); ... 122 //TM_END(); EDIT dll.c 73 for (i = 0; i < NUM_ITERS; i+= TM_GET_NUM... 74 75 TM_BEGIN(); 76 77 item = dequeue(); 78 enqueue(item); 79 80 TM_END(); 81 82 } ACTION! %> make run
Replay on Local Machine • Runs sequentially following LOGFILE • Faster execution • GDB support already exist • Does not support machine specific code ACTION! %> make replay_local LOGFILE=commit.out
GDB and Replay ACTION! %> echo “commit_in commit.error” > config.tcc %> gdb --args ./dll_local 8 About to dequeue (gdb) break dll.c:88 (gdb) condition 1 Head->id==3 (gdb) run (gdb) p Head->id (gdb) p Tail->id (gdb) p myid (gdb) break dll.c:110 (gdb) condition 2 myid==3 (gdb) continue (gdb) p myid (gdb) p Tail->id 1 About to enqueue 4
Debugging Conclusion Slide • Deterministic replay • Provides regeneration of buggy scenario • Allows embedding monitoring code without contaminating the buggy scenario • All Transactions All The Time concept helps parallel code debugging • Easier deterministic replay • Easier GDB support