Microarchitectural Techniques to Exploit Repetitive Computations and Values

LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente Advisors: Antonio González and Jordi Tubella

Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

Even with aggressive compilers Repetition is relatively common Motivation • General by design • real-world programs • operating systems • Often designed in mind to • future expansion • code reuse • Input sets have little variation

Computations Values Types of Repetition Repetition z = F (x, y)

Repetitive Computations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions

Computations Values Types of Repetition Repetition z = F (x, y)

Repetitive Values 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions, analysis of destination value

Exploit Value Repetition of Store Instructions • Redundant store instructions • Non redundant data cache To improve the memory system • Exploit Computation Repetition of all Insts • Redundant computation buffer (ILR) • Trace-level reuse (TLR) • Trace-level speculative multithreaded architecture (TLS) To speed-up the execution of instructions Objectives

Experimental Framework • Methodology • Analysis of benchmarks • Definition of proposal • Evaluation of proposal • Tools • Atom • Cacti 3.0 • Simplescalar Tool Set • Benchmarks • Spec CPU95 • Spec CPU2000

Redundant Stores Non Redundant Cache Techniques to Improve Memory Value Repetition

Memory @i Value X Redundant Stores Instructions • Do NOT modify memory STORE (@i , Value Y) • If (Value X==Value Y) then Redundant Store • Contributions • Redundant stores • Analysis of repetition into same storage location • Redundant stores applied to reduce memory traffic • Main results • 15%-25% of redundant store instructions • 5%-20% of memory traffic reduction Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99

Data Cache Tag X 1234 Value A Value B FFFF Tag Y Value C 0000 Value D 1234 Non Redundant Data Cache • If (Value A==Value D) then Value Repetition • Contributions • Analysis of repetition in severalstorage locations • Non redundant data cache (NRC) • Main results • On average, a value is stored 4 times at any given time • NRC: -32% area, -13% energy, -25% latency, +5% miss Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03 Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03

Data Value Reuse Data Value Speculation Techniques to Speed-up I Execution Computation Repetition • Avoid serialization caused by data dependences • Determine results of instructions without executing them • Target is to speed-up the execution of programs

Data Value Reuse Data Value Speculation • NON SPECULATIVE !!! • Buffers previous inputs and their corresponding outputs • Only possible if a computation has been done in the past • Inputs have to be ready at reuse test time Techniques to Speed-up I Execution Computation Repetition

Data Value Reuse Data Value Speculation • SPECULATIVE !!! • Predicts values as a function of the past history • Needs to confirm speculation at a later point • Solves reuse test but introduces misspeculation penalty Techniques to Speed-up I Execution Computation Repetition

Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Applied to a SINGLE instruction Techniques to Speed-up I Execution Computation Repetition

Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Applied to a GROUP of instructions Techniques to Speed-up I Execution Computation Repetition

Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Techniques to Speed-up I Execution Computation Repetition

index OOO Execution Fetch Commit Decode & Rename Instruction Level Reuse (ILR) RCB Reuse Table • Redundant Computation Buffer (RCB) • Contributions • Performance potential of ILR • Main results • Ideal ILR speed-up of 1.5 • RCB speed-up of 1.1 (outperforms previous proposals) Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99

I1 I2 I3 TRACE I4 I5 I6 Trace Level Reuse (TLR) • Contributions • Trace Level Reuse • Initial design issues for integrating TLR • Performance potential of TLR • Main results • Ideal TLR speed-up of 3.6 • 4K-entry table: 25% of reuse, average trace size of 6 González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99

Microarchitecture Support for Trace Speculation Control and Data Speculation Techniques Static Analysis Based on Profiling Info TSMA Trace Level Speculation (TLS) • Two orthogonal issues • Compiler analysis to support TSMA • Contributions • Trace Level Speculative Multithreaded Architecture • Main results • speedup of 1.38 with a 20% of misspeculations Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina,González,Tubella “Compiler Analysis forTSMA”, INTERACT’05 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05

Objectives & Proposals • To improve the memory system • Redundant store instructions • Non redundant data cache • To speed-up the execution of instructions • Redundant computation buffer (ILR) • Trace-level reuse buffer (TLR) • Trace-level speculative multithreaded architecture (TLS)

Motivation • Caches spend close to 50% of total die area • Caches are responsible of a significant part of total power dissipated by a processor

Data Value Repetition percentage of repetitive values percentage of time Spec CPU2000, 1 billion instructions, 256KB data cache

Value A 1234 Value B FFFF Tag X Value C 0000 Value D 1234 Tag Y Conventional Cache • If (Value A==Value D) then Value Repetition

1234 FFFF 0000 Tag X 1234 0000 1234 Tag Y FFFF Non Redundant Data Cache Pointer Table Value Table Die Area Reduction

0000 Tag X 1234 Tag Y FFFF Additional Hardware: Pointers Non Redundant Data Cache Pointer Table Value Table

0000 Tag X 1234 Tag Y FFFF Additional Hardware: Counters Non Redundant Data Cache Pointer Table Value Table 1 2 1

Data Value Inlining • Some values can be represented with a small number of bits (Narrow Values) • Narrow values can be inlined into pointer area • Simple sign extension is applied • Benefits • enlarges effective capacity of VT • reduces latency • reduces power dissipation

0000 1 Tag X Tag Y FFFF 1 Non Redundant Data Cache Pointer Table Value Table F 1234 2 0 Data Value Inlining

Miss Rate vs Die Area L2 Cache: 256KB 512KB 1MB 2MB 4MB % % % % Miss Ratio % % % % | | | 0,1 0,5 1,0 cm2 CONV VT20 VT50 VT30 Spec CPU2000, 1 billion instructions

Results • Caches ranging from 256 KB to 4 MB

Trace Level Speculation • Avoids serialization caused by data dependences • Skips in a row multiple instructions • Predicts values based on the past • Solves live-input test • Introduces penalties due to misspeculations

Trace Level Speculation • Two orthogonal issues • microarchitecture support for trace speculation • control and data speculation techniques • prediction of initial and final points • prediction of live output values • Trace Level Speculative Multithreaded Architecture (TSMA) • does not introduce significant misspeculation penalties • Compiler Analysis • based on static analysis that uses profiling data

Live Output Update & Trace Speculation Instruction Flow Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION INSTRUCTION VALIDATION INSTRUCTION SPECULATION Trace Level Speculation with Live Output Test ST NST

ST I Window NST I Window ST Ld/St Queue Branch Decode & Functional Fetch I NST Ld/St Queue Units Engine Predictor Cache Rename ST Reorder Buffer Trace NST Reorder Buffer Speculation Data L1SDC Cache NST Arch. Verification ST Arch. Register File Engine Register File L1NSDC L2NSDC TSMA Block Diagram Look Ahead Buffer

Compiler Analysis • Focuses on • developing effective trace selection schemes for TSMA • based on static analysis that uses profiling data • Trace Selection • Graph Construction (CFG & DDG) • Graph Analysis

Graph Analysis • Two important issues • initial and final point of a trace • maximize trace length & minimize misspeculations • predictability of live output values • prediction accuracy and utilization degree • Three basic heuristics • Procedure Trace Heuristic • Loop Trace Heuristic • Instruction Chaining Trace Heuristic

Trace Speculation Engine • Traces are communicated to the hardware • at program loading time • filling a special hardware structure (trace table) • Each entry of the trace table contains • initial PC • final PC • live-output values information • branch history • frequency counter

Simulation Parameters • Base microarchitecture • out of order machine, 4 instructions per cycle • I cache: 16KB, D cache: 16KB, L2 shared: 256KB • bimodal predictor • 64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps • TSMA additional structures • each thread: I window, reorder buffer, register file • speculative data cache: 1KB • trace table: 128 entries, 4-way set associative • look ahead buffer: 128 entries • verification engine: up to 8 instructions per cycle

Speedup 1.45 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00 Spec CPU2000, 250 million instructions

Misspeculations Spec CPU2000, 250 million instructions

Outline • Motivation & Objectives • Overview of Proposals • To improve memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work

Conclusions • Repetition is very common in programs • Can be applied • to improve the memory system • to speed-up the execution of instructions • Investigated several alternatives • Novel cache organizations • Instruction level reuse approach • Trace level reuse concept • Trace level speculation architecture

Future Work • Value repetition in instruction caches • Profiling to support datavalue reuse schemes • Traces starting at different PCs • Value prediction in TSMA • Multiple speculations in TSMA • Multiple threads in TSMA

Microarchitectural Techniques to Exploit Repetitive Computations and Values

Microarchitectural Techniques to Exploit Repetitive Computations and Values

Presentation Transcript

General approach to exploit detection and signature generation

How to Control Repetitive Hand and Wrist Tasks

Repetitive Manufacturing

Automatic Patch-Based Exploit Generation is Possible: Techniques and Implications

Malware and Exploit Enabling Code

Repetitive Structures

Exploit-Me

Low-complexity and Repetitive Regions

Numeric Values and Computations

Exploit: Password Cracking

Microarchitectural Techniques for Power Gating of Execution Units

Repetitive Structure

Repetitive motion and ergonomics

Parallel Machines and Computations.

Estimating and Computations

Repetitive Structures

Computations

Repetitive DNA

Amsterdam Exploit

Decimals and Decimal Computations

Values and Values Systems

Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors