Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University

Moore’s Law: the Original Version Log transistors on a chip Time exponentially increasing resources

Moore’s Law: the Popular Interpretation Log performance Time increase resources  increase performance?

Instruction-Level Parallelism (ILP) Datapath Size (8b, 16b, 32b, 64b)  ILP is running out of steam A Superposition of Innovations Log of Performance Time

Why ILP is Running Out of Steam Cross-chip wire latency (in cycles): Development cost: Power density: Probability of a defect: these problems must be addressed

? we are here Instruction-Level Parallelism (ILP) Datapath Size (8b, 16b, 32b, 64b) now How Do We Sustain the Performance Curve? Log of Performance Time  what is the next big win for micro-architecture?

P P C C C A New Path: Thread-Level Parallelism Tolerate cross-chip wire latency: • localized wires Lower development cost: • stamp out processor cores Lower power: • turn off idle processors Tolerate defects: • disable any faulty processor Processors Caches Chip Multiprocessor (CMP) many advantages

Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (CMP) (ALPHA 21464, Intel Xeon) (IBM Power4, SUN MAJC, Sibyte SB-1250) multithreading on a chip! Multithreading in Every Scale of Machine Threads Supercomputers

P P P P P C C C C C C C Improving Performance with a Chip Multiprocessor Multiprogramming Workload: Applications Execution Time Processor Caches improves throughput

P P P P P P P P P C C C C C C C C C C C C Improving Performance with a Chip Multiprocessor Single Application:  Exec. Time need parallel threads to reduce execution time

How Do We Parallelize Everything? 1) Programmers write parallel code from now on • time-consuming and frustrating • very hard to get right • not a broad solution 2) System parallelizes automatically • no burden on the programmer • parallelize any application automatic parallelization is preferred

for (i = 1;i < N;i++) A[i] = A[i-1]; Current Technique: Prove Independence Independent Dependent A[0]0 for (i = 0;i < N;i++) A[1]0 A[i] = 0; A[2]0 A[1]A[0] A[2]A[1] A[3]A[2] need to fully understand data access pattern

Ubiquitous Parallelization: How Close Are We? Compiler can parallelize portions of numeric programs • scientific, floating-point, array-based codes • usually written in fortran What about everything else? • general-purpose, integer codes • written in C, C++, Java, etc. • little (if any) success so far parallelize by proving independence proving independence is infeasible

for (i = 0;i < N;i++) A[i] = A[B[i]]; while (...){ ... = *q; *p = ...; } The Main Culprit: Indirection Indirect array references A[0]A[B[0]] ? A[1]A[B[1]] ? A[2]A[B[2]] need to know the values of B[] Pointers …  *q *p  … ? …  *q *p  … need to know the targets of p and q

Summary We need the next big performance win • instruction-level parallelism will run out of gas Multithreading will soon be everywhere • we need automatically-parallelized programs The scope of current techniques is extremely limited • proving independence is infeasible  A solution: Thread-Level Speculation (TLS)

…*q violation *p…    Recover TLS Exec. Time …*q  exploit available thread-level parallelism Thread-Level Speculation: the Basic Idea 

Outline The Software/Hardware Sweet Spot • Compiler Support • Industry-Friendly Hardware • Improving Value Communication • Conclusions

   Support for TLS: What Do We Need? Break programs into speculative threads • to maximize thread-level parallelism Track data dependences • to determine whether speculation was safe Recover from failed speculation • to ensure correct execution three key elements of every TLS system

Compiler Researchers do it in Software

 software dependence tracking was parallel execution safe?   LRPD Test (Illinois at UC) + implemented entirely in software – applies only to array-based code – no partial parallelism Exec. Time

Architects do it in Hardware

 P P P P P P P P ARB   Multiscalar (Wisconsin) • compiler breaks program into threads • Address Resolution Buffer (ARB) + – highly specialized for speculation

Our Approach: Find the Sweet Spot Compiler: + global view of control flow – hard/impossible to understand data dependences Hardware: – operates on a small window of instructions + observes dynamic memory accesses leverage their respective strengths

   The Sweet Spot • Compiler: • break programs into speculative threads • why: compiler has a global view of control flow • Hardware: • track data dependences • why: software comparison of all addresses infeasible • recover from failed speculation • why: software buffering of all writes infeasible important: minimize additional hardware

Outline The Software/Hardware Sweet Spot Compiler Support • Industry-Friendly Hardware • Improving Value Communication • Conclusions

 Compiler Support for TLS profile information inserts TLS instructions which loops? Transformation and Optimization Sequential SourceCode MIPS Executable Region Selection

P P P P Simple Performance Model Dependence Tracking • 4 processors • Each processor issues one instruction per cycle • No communication latency between processors  shows potential performance benefit

Potential Improvement significant impact on execution time

Outline  The Software/Hardware Sweet Spot  Compiler Support  Industry-Friendly Hardware • Improving Value Communication • Conclusions

Goals 1) Handle arbitrary memory accesses • i.e. not just array references 2) Preserve single-thread performance • keep hardware support minimal and simple 3) Apply to any scale of multithreaded architecture • within a chip and beyond effective, simple, scalable

  Requirements 1) Recover from failed speculation • buffer speculative writes from memory 2) Track data dependences • detect data dependence violations  each has several implementation options

store buffer Proc Recover From Failed Speculation: Option 1 Augment the store buffer: + common device in superscalar processors • facilitates non-blocking stores – too small

Proc Recover From Failed Speculation: Option 2 Add a new dedicated buffer + can design an efficient speculation mechanism – want to avoid large speculation-specific structures

Proc Cache Recover From Failed Speculation: Option 3 Augment the cache + very common structure + relatively large  just maintain single-thread performance

violation detected   Tracking Data Dependences: Option 1 Load X Add a dedicated “3rd-party” entity – want to avoid large speculation-specific structures – does not scale Store X P P C C  Dependence Tracker

P P C C load address violation  detected   Tracking Data Dependences: Option 2 Load X Detection at the producer • producer informed of all addresses consumed – awkward: producer must notify consumer of any violation Store X Consumer Producer

P P C  violation detected store address   Tracking Data Dependences: Option 3 Load X Detection at the consumer • consumers informed of all addresses produced Store X Consumer Producer C similar to invalidation-based cache coherence!

P - - - - - - - - - - - -   Augmenting the Cache Cache Data State Tag

P SM SL Tag - - - - - - - - - - - - - - - - - - - -   Augmenting the Cache Cache Speculatively Loaded Data State Speculatively Modified modest amount of extra space

P SM SL Tag Y V X valid valid valid Z # # # # 0 0 0 0 1 1 0 1   Augmenting the Cache Cache Data State valid when speculation fails…

P SL SM Tag 0 invalid invalid Y valid # 0 0 0 0 0 0 0 - - - - - -   Augmenting the Cache Cache Data State invalid …can quickly discard speculative state

4 5 P X is speculatively P loaded C  violation detected (4<5) invalidate X; from4   Extending Cache Coherence Load X Store X C straightforward extension of cache coherence

C C C P P Crossbar Detailed Performance Model Underlying architecture • single-chip multiprocessor • implements speculative coherence Simulator • superscalar, a modernized MIPS R10K • models all bandwidth and contention detailed simulation!

Will it Work at All of These Scales? Threads Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (CMP) Supercomputers yes: coherence scales up and down

Performance on Multi-Chip Systems our scheme is scalable

Performance on General-Purpose Applications significant performance improvements

Outline  The Software/Hardware Sweet Spot  Compiler Support  Industry-Friendly Hardware Improving Value Communication • Conclusions

Memory Speculate Load *q Store *p good when p != q

Memory Synchronize (and forward) Load *q Store *p Wait Store *p Memory (stall) Signal (Speculate) Load *q good when p == q

Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

Presentation Transcript

A Synthesis of Science, Mathematics and Computer Science: Medical Applications, Space Exploration and Robotics

Explorations in Computer Science for High School Educators

Social Studies Review

Chapter 4: Multiprocessors and Thread-Level Parallelism

CPE 631: Multiprocessors and Thread-Level Parallelism

COMPUTER ORGANIZATION AND ARCHITECTURE

CSE 230 Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

General-Purpose Many-Core Parallelism – Broken, But Fixable

A PHILOSOPHY OF COMPUTER SCIENCE

Modelling of Ecosystems by Tools from Computer Science

John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

Day 2

Process Internals

Chapter 3: Instruction-Level Parallelism

Master Program (Laurea Magistrale) in Computer Science and Networking

The Design and Implementation of a Context-Aware File System for Ubiquitous Computing Applications

Advanced Computer Networking

Phase: An Important Low-Level Image Invariant Peter Kovesi

EEL 5764 Graduate Computer Architecture Chapter 2 - Instruction Level Parallelism