570 likes | 686 Views
Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University. Moore’s Law: the Original Version. Log transistors on a chip. Time. exponentially increasing resources. Moore’s Law: the Popular Interpretation.
E N D
Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University
Moore’s Law: the Original Version Log transistors on a chip Time exponentially increasing resources
Moore’s Law: the Popular Interpretation Log performance Time increase resources increase performance?
Instruction-Level Parallelism (ILP) Datapath Size (8b, 16b, 32b, 64b) ILP is running out of steam A Superposition of Innovations Log of Performance Time
Why ILP is Running Out of Steam Cross-chip wire latency (in cycles): Development cost: Power density: Probability of a defect: these problems must be addressed
? we are here Instruction-Level Parallelism (ILP) Datapath Size (8b, 16b, 32b, 64b) now How Do We Sustain the Performance Curve? Log of Performance Time what is the next big win for micro-architecture?
P P C C C A New Path: Thread-Level Parallelism Tolerate cross-chip wire latency: • localized wires Lower development cost: • stamp out processor cores Lower power: • turn off idle processors Tolerate defects: • disable any faulty processor Processors Caches Chip Multiprocessor (CMP) many advantages
Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (CMP) (ALPHA 21464, Intel Xeon) (IBM Power4, SUN MAJC, Sibyte SB-1250) multithreading on a chip! Multithreading in Every Scale of Machine Threads Supercomputers
P P P P P C C C C C C C Improving Performance with a Chip Multiprocessor Multiprogramming Workload: Applications Execution Time Processor Caches improves throughput
P P P P P P P P P C C C C C C C C C C C C Improving Performance with a Chip Multiprocessor Single Application: Exec. Time need parallel threads to reduce execution time
How Do We Parallelize Everything? 1) Programmers write parallel code from now on • time-consuming and frustrating • very hard to get right • not a broad solution 2) System parallelizes automatically • no burden on the programmer • parallelize any application automatic parallelization is preferred
for (i = 1;i < N;i++) A[i] = A[i-1]; Current Technique: Prove Independence Independent Dependent A[0]0 for (i = 0;i < N;i++) A[1]0 A[i] = 0; A[2]0 A[1]A[0] A[2]A[1] A[3]A[2] need to fully understand data access pattern
Ubiquitous Parallelization: How Close Are We? Compiler can parallelize portions of numeric programs • scientific, floating-point, array-based codes • usually written in fortran What about everything else? • general-purpose, integer codes • written in C, C++, Java, etc. • little (if any) success so far parallelize by proving independence proving independence is infeasible
for (i = 0;i < N;i++) A[i] = A[B[i]]; while (...){ ... = *q; *p = ...; } The Main Culprit: Indirection Indirect array references A[0]A[B[0]] ? A[1]A[B[1]] ? A[2]A[B[2]] need to know the values of B[] Pointers … *q *p … ? … *q *p … need to know the targets of p and q
Summary We need the next big performance win • instruction-level parallelism will run out of gas Multithreading will soon be everywhere • we need automatically-parallelized programs The scope of current techniques is extremely limited • proving independence is infeasible A solution: Thread-Level Speculation (TLS)
…*q violation *p… Recover TLS Exec. Time …*q exploit available thread-level parallelism Thread-Level Speculation: the Basic Idea
Outline The Software/Hardware Sweet Spot • Compiler Support • Industry-Friendly Hardware • Improving Value Communication • Conclusions
Support for TLS: What Do We Need? Break programs into speculative threads • to maximize thread-level parallelism Track data dependences • to determine whether speculation was safe Recover from failed speculation • to ensure correct execution three key elements of every TLS system
Compiler Researchers do it in Software
software dependence tracking was parallel execution safe? LRPD Test (Illinois at UC) + implemented entirely in software – applies only to array-based code – no partial parallelism Exec. Time
Architects do it in Hardware
P P P P P P P P ARB Multiscalar (Wisconsin) • compiler breaks program into threads • Address Resolution Buffer (ARB) + – highly specialized for speculation
Our Approach: Find the Sweet Spot Compiler: + global view of control flow – hard/impossible to understand data dependences Hardware: – operates on a small window of instructions + observes dynamic memory accesses leverage their respective strengths
The Sweet Spot • Compiler: • break programs into speculative threads • why: compiler has a global view of control flow • Hardware: • track data dependences • why: software comparison of all addresses infeasible • recover from failed speculation • why: software buffering of all writes infeasible important: minimize additional hardware
Outline The Software/Hardware Sweet Spot Compiler Support • Industry-Friendly Hardware • Improving Value Communication • Conclusions
Compiler Support for TLS profile information inserts TLS instructions which loops? Transformation and Optimization Sequential SourceCode MIPS Executable Region Selection
P P P P Simple Performance Model Dependence Tracking • 4 processors • Each processor issues one instruction per cycle • No communication latency between processors shows potential performance benefit
Potential Improvement significant impact on execution time
Outline The Software/Hardware Sweet Spot Compiler Support Industry-Friendly Hardware • Improving Value Communication • Conclusions
Goals 1) Handle arbitrary memory accesses • i.e. not just array references 2) Preserve single-thread performance • keep hardware support minimal and simple 3) Apply to any scale of multithreaded architecture • within a chip and beyond effective, simple, scalable
Requirements 1) Recover from failed speculation • buffer speculative writes from memory 2) Track data dependences • detect data dependence violations each has several implementation options
store buffer Proc Recover From Failed Speculation: Option 1 Augment the store buffer: + common device in superscalar processors • facilitates non-blocking stores – too small
Proc Recover From Failed Speculation: Option 2 Add a new dedicated buffer + can design an efficient speculation mechanism – want to avoid large speculation-specific structures
Proc Cache Recover From Failed Speculation: Option 3 Augment the cache + very common structure + relatively large just maintain single-thread performance
violation detected Tracking Data Dependences: Option 1 Load X Add a dedicated “3rd-party” entity – want to avoid large speculation-specific structures – does not scale Store X P P C C Dependence Tracker
P P C C load address violation detected Tracking Data Dependences: Option 2 Load X Detection at the producer • producer informed of all addresses consumed – awkward: producer must notify consumer of any violation Store X Consumer Producer
P P C violation detected store address Tracking Data Dependences: Option 3 Load X Detection at the consumer • consumers informed of all addresses produced Store X Consumer Producer C similar to invalidation-based cache coherence!
P - - - - - - - - - - - - Augmenting the Cache Cache Data State Tag
P SM SL Tag - - - - - - - - - - - - - - - - - - - - Augmenting the Cache Cache Speculatively Loaded Data State Speculatively Modified modest amount of extra space
P SM SL Tag Y V X valid valid valid Z # # # # 0 0 0 0 1 1 0 1 Augmenting the Cache Cache Data State valid when speculation fails…
P SL SM Tag 0 invalid invalid Y valid # 0 0 0 0 0 0 0 - - - - - - Augmenting the Cache Cache Data State invalid …can quickly discard speculative state
4 5 P X is speculatively P loaded C violation detected (4<5) invalidate X; from4 Extending Cache Coherence Load X Store X C straightforward extension of cache coherence
C C C P P Crossbar Detailed Performance Model Underlying architecture • single-chip multiprocessor • implements speculative coherence Simulator • superscalar, a modernized MIPS R10K • models all bandwidth and contention detailed simulation!
Will it Work at All of These Scales? Threads Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (CMP) Supercomputers yes: coherence scales up and down
Performance on Multi-Chip Systems our scheme is scalable
Performance on General-Purpose Applications significant performance improvements
Outline The Software/Hardware Sweet Spot Compiler Support Industry-Friendly Hardware Improving Value Communication • Conclusions
Memory Speculate Load *q Store *p good when p != q
Memory Synchronize (and forward) Load *q Store *p Wait Store *p Memory (stall) Signal (Speculate) Load *q good when p == q
Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path