Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan† and Todd C. Mowry School of Computer Science Carnegie Mellon University †Dept. Elec. & Comp. Engineering University of Toronto

Motivation Chip-level multiprocessing is becoming commonplace • UntraSPARC IV • 2 UltraSparc III cores • IBM Power 4 • SUN MAJC • Sibyte SB-1250 Can multithreaded processors improve the performance of a single application?  We need parallel programs - 2 -

Why Is Automatic Parallelization Difficult? Automatic parallelization today • Must statically prove threads are independent • Constructing proofs is difficult due to ambiguous data dependences • Complex control flow • Pointers and indirect references • Runtime inputs Optimistic compiler? • Limited only by true dependences  One solution: Thread-Level Speculation - 3 -

Time a a a r Thread 5 Thread 6 Retry Thread 4 Thread 7 … = hash[31] … hash[12] = ... check_dep() … = hash[9] … hash[44] = ... check_dep() … = hash[10] … hash[25] = ... check_dep() … = hash[27] … hash[32] = ... check_dep() a Example Processor 1 Processor 2 Processor 3 Processor 4 Thread 1 Thread 2 Thread 3 Thread 4 … = hash[3] … hash[10]= ... check_dep() … = hash[19] … hash[21] = ... check_dep() … = hash[33] … hash[30] = ... check_dep() … = hash[10] … hash[25] = ... check_dep() • while (...){ • … • x=hash[index1]; • … • hash[index2]=y; • ... • } - 4 -

Frequently Dependent Scalars Producer Consumer …=a Time …=a a=… a=…  Can identify scalars that always cause dependences - 5 -

Frequently Dependent Scalars Producer Consumer …=a Wait(a) Time a=… Signal(a) …=a a=… Dependent scalars should be synchronized [ASPLOS’02] - 6 -

Frequently Dependent Scalars Producer Consumer Time …=a …=a a=… a=… Dataflow analysis allows us to deal with complex control flow [ASPLOS’02] - 7 -

Communicating Memory-Resident Values Producer Consumer Load *p Synchronize? Speculate? Time Load *p Store *q Store *q Will speculation succeed? - 8 -

Speculation vs. Synchronization Sequential Execution Speculative Parallel Execution Load *p Load *p Load *p Store *q Load *p Store *q Load *p Store *q Time Store *q Load *p Store *q Store *q Load *p Store *q Load *p Store *q Speculation succeeds: efficient - 9 -

      Speculation vs. Synchronization Sequential Execution Speculative Parallel Execution violation Load *p Load *p Load *p Load *p Store *q Store *q Load *p Store *q Time Store *q Load *p Store *q Load *p Store *q Load *p Load *p Store *q Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Store *q Load *p Store *q Speculation fails: inefficient - 10 -

Speculation vs. Synchronization Sequential Execution Speculative Parallel Execution Load *p Load *p Store *q Store *q Time Load *p Load *p Store *q Load *p Store *q Store *q Load *p Load *p Store *q Store *q Load *p Store *q • Frequent dependences: Synchronize • Infrequent dependences: Speculate - 11 -

Performance Potential 100 Norm. Regional Exec. Time 0 go gcc gap ijpeg mcf crafty parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp • Detailed simulation: • TLS support • 4-processor CMP • 4-way issue, out-of-order superscalar • 10-cycle communication latency Original Perfect memory value Prediction Reducing failed speculation improves performance - 12 -

Hardware vs. Compiler Inserted Synchronization Producer Producer Producer Consumer Consumer Consumer Store *q Load *p Wait() Signal() Store*q Store*q Time (stall) Load *p Load *p Memory Memory Memory Hardware-inserted Synchronization [HPCA’02] Compiler-inserted Synchronization [CGO’04] Speculation - 13 -

Issues in Synchronizing Memory-Resident Values Producer • Static analysis • Which instructions to synchronize? • Inter-procedural dependences • Runtime • Detecting and recovering from improper synchronization Consumer Load *p Store *q Time - 14 -

Outline Producer • Static analysis • Runtime checks • Results • Conclusions Consumer Store *q Time Load *p - 15 -

Compiler Passes foo.c Front End Profile Data Dependences Create Threads Insert Synchronization Decide what to Synchronize Schedule Instructions Back End foo.exe - 16 -

Example do { push (&set, element); work(); } while (test); work() push (head, entry) - 17 -

Example do { push (&set, element); work(); } while (test); work() { if (condition(&set)) push (&set, element); } push (head, entry) - 18 -

Example do { push (&set, element); work(); } while (test); Store *head (push) Load *head (push) work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head Store *head Store *head (work, push) Load *head (work, push) - 19 -

Compiler Passes foo.c Front End Profile Data Dependences Thread Creating Insert Synchronization Decide what to Synchronize Instruction Scheduling Back End foo.exe - 20 -

Example do { push (&set, element); work(); } while (test); Store *head (push) Load *head (push) work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10 push(head,entry) { entry->next = *head; *head = entry; } Store *head (work, push) Load *head (work, push) - 21 -

Dependence Graph Store *head (push) Store *head (work, push) 10 990 10 Load *head (push) Load *head (work, push) Infrequent dependences: occur in less than 5% of iterations Pairs that need to be synchronized can be extracted from the dependence graph - 23 -

Load *head (push) 990 Store *head (push) Synchronize these Example do { push (&set, element); work(); } while (test); push_clone(&set, element); Load *head (push) work() { if (condition(&set)) push (&set, element); } push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head); } push(head,entry) { entry->next = *head; *head = entry; } Store *head (push) push(head,entry) { entry->next = *head; *head = entry; } - 25 -

Runtime Checks Producer • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandLoad *p Consumer Signal(q, *q); Store *q Time Load *p • Producer forwards the address to ensure a match between the load and the store - 27 -

Ensuring Correctness Producer Consumer Store *q Time Load *p Store *x • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support • Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] - 28 -

Ensuring Correctness Producer Consumer Store *y Store *q Time Load *p • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support: TLS hardware already knows which locations are stored to - 29 -

C C C Experimental Framework Underlying architecture • 4-processor, single-chip multiprocessor • speculation supported through coherence Simulator • superscalar, similar to MIPS R14K • 10-cycle communication latency • models all bandwidth and contention Benchmarks • SPECint95 and SPECint2000, -O3 optimization P P Crossbar detailed simulation - 31 -

Parallel Region Coverage 100 Parallel Region Coverage 0 go gcc gap ijpeg mcf crafty parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp • Coverage is significant • Average coverage: 54% - 32 -

Failed Speculation Synchronization Stall Other Busy Compiler-Inserted Synchronization 10% 46% 13% 5% 8% 5% 21% 100 Norm. Regional Exec. Time 0 C C C C C C C C C C C C C U U U U U U U U U U U U U go gcc gap ijpeg mcf crafty parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp U=No synchronization inserted C=Compiler-Inserted Synchronization Seven benchmarks speed up by 5% to 46% - 33 -

Failed Speculation Synchronization Stall Other Busy Compiler- vs. Hardware-Inserted Synchronization Hardware does better Compiler does better 100 Norm. Regional Exec. Time 0 C H C H C H C H C H C H C H C H C H C H C H C H C H mcf gcc crafty go gap ijpeg parser perlbmk vpr_place m88ksim gzip_comp bzip2_comp gzip_decomp C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware [HPCA’02] each benefits different benchmarks - 34 -

Failed Speculation Synchronization Stall Other Busy Combining Hardware and Compiler Synchronization 100 Norm. Regional Exec. Time 0 C H B C H B C H B C H B C H B C H B go gap perlbmk m88ksim gzip_comp gzip_decomp C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both The combination is more robust than each technique individually - 35 -

Related Work Compiler-inserted Hardware-inserted Zhai et. al. CGO’04 Steffan et. al. HPCA’02 Cytron ICPP’86 Moshovos et. al. ISCA’97 Tsai & Yew PACT’96 Cintra & Torrellas HPCA’02 Distributed Table CentralizedTable - 36 -

Conclusions Compiler-inserted synchronization for memory-resident value communication: • Effective in reducing speculation failure • Half of the benchmarks speedup by 5% to 46% (regional) • Combining hardware and compiler techniques is more robust • Neither consistently outperforms the other • Can be combined to track the best performer • Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware - 37 -

Questions? - 38 -

Failed Speculation Synchronization Stall Other Busy The Potential of Instruction Scheduling 100 0 ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL go gap gcc ijpeg mcf crafty parser perlbmk m88ksim vpr_place gzip_decomp gzip_comp_R gzip_comp Bzip2_comp E=Early C=Compiler-Inserted Synchronization L=Late Scheduling instructions has addition benefit for some benchmarks - 39 -

Failed Speculation Synchronization Stall Other Busy Program Performance 100 0 UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB gcc gap ijpeg mcf crafty go twolf parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp gzip_comp_R bzip2_decomp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware - 40 -

Which Technique Synchronizes This Load? 100 0 UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB go gap mcf gcc ijpeg crafty twolf parser gzip_comp perlbmk m88ksim bzip2_comp vpr_place gzip_comp_R gzip_decomp Synchronized by neither technique U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Synchronized by compiler Synchronized by hardware Synchronized by both - 41 -

Ensuring Correctness Producer Consumer Store *q Load *p Store *x • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support • Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] - 42 -

Local Store to *p NO YES q == p NO YES Use Forwarded Value Use Memory Value Ensuring Correctness Producer Consumer Store *x Signal(q); Signal(*q) Store *q Load *p • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support Use the forwarded value only if the synchronized pair is dependent - 43 -

Issues in Synchronizing Memory-Resident Values Producer • Inserting synchronization using compilers • Ensuring correctness • Reducing synchronization cost Consumer Store *q Load *p - 44 -

Reducing Cost of Synchronization Producer Producer Consumer Consumer After Instruction Scheduling Before Instruction Scheduling • Instruction scheduling algorithms are described in [ASPLOS’02] - 45 -

Failed Speculation Synchronization Stall Other Busy The Potential of Instruction Scheduling 100 Norm. Regional Exec. Time 0 E C L E C L E C L E C L E C L E C L ijpeg gap m88ksim gzip_comp vpr_place gzip_decomp E = Perfectly predicting synchronized memory-resident values C = Compiler-inserted synchronization L = Consumer stalls until previous thread commits Scheduling instructions could offer additional benefit - 46 -

Failed Speculation Synchronization Stall Other Busy Using More Accuracy of Profiling Information 100 Norm. Regional Exec. Time 0 U C R gzip_comp U=No Instruction Scheduling C=Compiler-Inserted Synchronization R=Compiler-Inserted Synchronization (Profiled with the ref input set) Gzip_comp is the only benchmark sensitive to profiling input - 47 -

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads