520 likes | 628 Views
LECTURA DE TESIS, (Barcelona, 14 de Diciembre de 2005). Microarchitectural Techniques to Exploit Repetitive Computations and Values. Carlos Molina Clemente. Advisors: Antonio González and Jordi Tubella. Outline. Motivation & Objectives Overview of Proposals To improve the memory system
E N D
LECTURA DE TESIS, (Barcelona,14 de Diciembre de 2005) Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente Advisors: Antonio González and Jordi Tubella
Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Even with aggressive compilers Repetition is relatively common Motivation • General by design • real-world programs • operating systems • Often designed in mind to • future expansion • code reuse • Input sets have little variation
Computations Values Types of Repetition Repetition z = F (x, y)
Repetitive Computations 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions
Computations Values Types of Repetition Repetition z = F (x, y)
Repetitive Values 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Spec CPU2000, 500 million instructions, analysis of destination value
Exploit Value Repetition of Store Instructions • Redundant store instructions • Non redundant data cache To improve the memory system • Exploit Computation Repetition of all Insts • Redundant computation buffer (ILR) • Trace-level reuse (TLR) • Trace-level speculative multithreaded architecture (TLS) To speed-up the execution of instructions Objectives
Experimental Framework • Methodology • Analysis of benchmarks • Definition of proposal • Evaluation of proposal • Tools • Atom • Cacti 3.0 • Simplescalar Tool Set • Benchmarks • Spec CPU95 • Spec CPU2000
Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Redundant Stores Non Redundant Cache Techniques to Improve Memory Value Repetition
Memory @i Value X Redundant Stores Instructions • Do NOT modify memory STORE (@i , Value Y) • If (Value X==Value Y) then Redundant Store • Contributions • Redundant stores • Analysis of repetition into same storage location • Redundant stores applied to reduce memory traffic • Main results • 15%-25% of redundant store instructions • 5%-20% of memory traffic reduction Molina, González, Tubella, “Reducing Memory Traffic via Redundant Store Instructions”, HPCN’99
Data Cache Tag X 1234 Value A Value B FFFF Tag Y Value C 0000 Value D 1234 Non Redundant Data Cache • If (Value A==Value D) then Value Repetition • Contributions • Analysis of repetition in severalstorage locations • Non redundant data cache (NRC) • Main results • On average, a value is stored 4 times at any given time • NRC: -32% area, -13% energy, -25% latency, +5% miss Molina, Aliagas, García,Tubella, González, “Non Redundant Data Cache”, ISLPED’03 Aliagas, Molina, García, González, Tubella, “Value Compression to Reduce Power in Data Caches”, EUROPAR’03
Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Data Value Reuse Data Value Speculation Techniques to Speed-up I Execution Computation Repetition • Avoid serialization caused by data dependences • Determine results of instructions without executing them • Target is to speed-up the execution of programs
Data Value Reuse Data Value Speculation • NON SPECULATIVE !!! • Buffers previous inputs and their corresponding outputs • Only possible if a computation has been done in the past • Inputs have to be ready at reuse test time Techniques to Speed-up I Execution Computation Repetition
Data Value Reuse Data Value Speculation • SPECULATIVE !!! • Predicts values as a function of the past history • Needs to confirm speculation at a later point • Solves reuse test but introduces misspeculation penalty Techniques to Speed-up I Execution Computation Repetition
Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Applied to a SINGLE instruction Techniques to Speed-up I Execution Computation Repetition
Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Applied to a GROUP of instructions Techniques to Speed-up I Execution Computation Repetition
Data Value Reuse Data Value Speculation Instruction Level Trace Level Instruction Level Trace Level Techniques to Speed-up I Execution Computation Repetition
index OOO Execution Fetch Commit Decode & Rename Instruction Level Reuse (ILR) RCB Reuse Table • Redundant Computation Buffer (RCB) • Contributions • Performance potential of ILR • Main results • Ideal ILR speed-up of 1.5 • RCB speed-up of 1.1 (outperforms previous proposals) Molina, González, Tubella, “Dynamic Removal of Redundant Computations”, ICS’99
I1 I2 I3 TRACE I4 I5 I6 Trace Level Reuse (TLR) • Contributions • Trace Level Reuse • Initial design issues for integrating TLR • Performance potential of TLR • Main results • Ideal TLR speed-up of 3.6 • 4K-entry table: 25% of reuse, average trace size of 6 González, Tubella, Molina, “Trace-Level Reuse”, ICPP’99
Microarchitecture Support for Trace Speculation Control and Data Speculation Techniques Static Analysis Based on Profiling Info TSMA Trace Level Speculation (TLS) • Two orthogonal issues • Compiler analysis to support TSMA • Contributions • Trace Level Speculative Multithreaded Architecture • Main results • speedup of 1.38 with a 20% of misspeculations Molina, González, Tubella, “Trace-Level Speculative Multithreaded Architecture (TSMA)”, ICCD’02 Molina,González,Tubella “Compiler Analysis forTSMA”, INTERACT’05 Molina, Tubella, González, “Reducing Misspeculation Penalty in TSMA”, ISHPC’05
Objectives & Proposals • To improve the memory system • Redundant store instructions • Non redundant data cache • To speed-up the execution of instructions • Redundant computation buffer (ILR) • Trace-level reuse buffer (TLR) • Trace-level speculative multithreaded architecture (TLS)
Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Motivation • Caches spend close to 50% of total die area • Caches are responsible of a significant part of total power dissipated by a processor
Data Value Repetition percentage of repetitive values percentage of time Spec CPU2000, 1 billion instructions, 256KB data cache
Value A 1234 Value B FFFF Tag X Value C 0000 Value D 1234 Tag Y Conventional Cache • If (Value A==Value D) then Value Repetition
1234 FFFF 0000 Tag X 1234 0000 1234 Tag Y FFFF Non Redundant Data Cache Pointer Table Value Table Die Area Reduction
0000 Tag X 1234 Tag Y FFFF Additional Hardware: Pointers Non Redundant Data Cache Pointer Table Value Table
0000 Tag X 1234 Tag Y FFFF Additional Hardware: Counters Non Redundant Data Cache Pointer Table Value Table 1 2 1
Data Value Inlining • Some values can be represented with a small number of bits (Narrow Values) • Narrow values can be inlined into pointer area • Simple sign extension is applied • Benefits • enlarges effective capacity of VT • reduces latency • reduces power dissipation
0000 1 Tag X Tag Y FFFF 1 Non Redundant Data Cache Pointer Table Value Table F 1234 2 0 Data Value Inlining
Miss Rate vs Die Area L2 Cache: 256KB 512KB 1MB 2MB 4MB % % % % Miss Ratio % % % % | | | 0,1 0,5 1,0 cm2 CONV VT20 VT50 VT30 Spec CPU2000, 1 billion instructions
Results • Caches ranging from 256 KB to 4 MB
Outline • Motivation & Objectives • Overview of Proposals • To improve the memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Trace Level Speculation • Avoids serialization caused by data dependences • Skips in a row multiple instructions • Predicts values based on the past • Solves live-input test • Introduces penalties due to misspeculations
Trace Level Speculation • Two orthogonal issues • microarchitecture support for trace speculation • control and data speculation techniques • prediction of initial and final points • prediction of live output values • Trace Level Speculative Multithreaded Architecture (TSMA) • does not introduce significant misspeculation penalties • Compiler Analysis • based on static analysis that uses profiling data
Live Output Update & Trace Speculation Instruction Flow Miss Trace Speculation Detection & Recovery Actions INSTRUCTION EXECUTION INSTRUCTION VALIDATION INSTRUCTION SPECULATION Trace Level Speculation with Live Output Test ST NST
ST I Window NST I Window ST Ld/St Queue Branch Decode & Functional Fetch I NST Ld/St Queue Units Engine Predictor Cache Rename ST Reorder Buffer Trace NST Reorder Buffer Speculation Data L1SDC Cache NST Arch. Verification ST Arch. Register File Engine Register File L1NSDC L2NSDC TSMA Block Diagram Look Ahead Buffer
Compiler Analysis • Focuses on • developing effective trace selection schemes for TSMA • based on static analysis that uses profiling data • Trace Selection • Graph Construction (CFG & DDG) • Graph Analysis
Graph Analysis • Two important issues • initial and final point of a trace • maximize trace length & minimize misspeculations • predictability of live output values • prediction accuracy and utilization degree • Three basic heuristics • Procedure Trace Heuristic • Loop Trace Heuristic • Instruction Chaining Trace Heuristic
Trace Speculation Engine • Traces are communicated to the hardware • at program loading time • filling a special hardware structure (trace table) • Each entry of the trace table contains • initial PC • final PC • live-output values information • branch history • frequency counter
Simulation Parameters • Base microarchitecture • out of order machine, 4 instructions per cycle • I cache: 16KB, D cache: 16KB, L2 shared: 256KB • bimodal predictor • 64-entry ROB, FUs: 4 int, 2 div, 2 mul, 4 fps • TSMA additional structures • each thread: I window, reorder buffer, register file • speculative data cache: 1KB • trace table: 128 entries, 4-way set associative • look ahead buffer: 128 entries • verification engine: up to 8 instructions per cycle
Speedup 1.45 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00 Spec CPU2000, 250 million instructions
Misspeculations Spec CPU2000, 250 million instructions
Outline • Motivation & Objectives • Overview of Proposals • To improve memory system • To speed-up the execution of instructions • Non Redundant Data Cache • Trace-Level Speculative Multithreaded Arch. • Conclusions & Future Work
Conclusions • Repetition is very common in programs • Can be applied • to improve the memory system • to speed-up the execution of instructions • Investigated several alternatives • Novel cache organizations • Instruction level reuse approach • Trace level reuse concept • Trace level speculation architecture
Future Work • Value repetition in instruction caches • Profiling to support datavalue reuse schemes • Traces starting at different PCs • Value prediction in TSMA • Multiple speculations in TSMA • Multiple threads in TSMA