740 likes | 913 Views
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology). David J. Lilja Department of Electrical and Computer Engineering University of Minnesota lilja@ece.umn.edu. Acknowledgements.
E N D
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture(Plus a Few Thoughts on Simulation Methodology) David J. Lilja Department of Electrical and Computer Engineering University of Minnesota lilja@ece.umn.edu
Acknowledgements • Graduate students (who did the real work) • Ying Chen • Resit Sendag • Joshua Yi • Faculty collaborator • Douglas Hawkins (School of Statistics) • Funders • National Science Foundation • IBM • HP/Compaq • Minnesota Supercomputing Institute
Problem #1 • Speculative execution is becoming more popular • Branch prediction • Value prediction • Speculative multithreading • Potentially higher performance • What about impact on the memory system? • Pollute cache/memory hierarchy? • Leads to more misses?
Problem #2 • Computer architecture research relies on simulation • Simulation is slow • Years to simulate SPEC CPU2000 benchmarks • Simulation can be wildly inaccurate • Did I really mean to build that system? • Results are difficult to reproduce • Need statistical rigor
Outline (Part 1) • The Superthreaded Architecture • The Wrong Execution Cache (WEC) • Experimental Methodology • Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 2003]
Hard-to-Parallelize Applications • Early exit loops • Pointers and aliases • Complex branching behaviors • Small basic blocks • Small loops counts → Hard to parallelize with conventional techniques.
Introduce Maybe dependences • Data dependence? • Pointer aliasing? • Yes • No • Maybe • Maybe allows aggressive compiler optimizations • When in doubt, parallelize • Run-time check to correct wrong assumption.
CONTINUATION -Values needed to fork next thread Fork Fork TARGET STORE -Forward addresses of maybe dependences CONTINUATION -Values needed to fork next thread … … Fork Sync Sync TARGET STORE -Forward addresses of maybe dependences … COMPUTATION -Forward addresses and computed data as needed CONTINUATION -Values needed to fork next thread … … Sync COMPUTATION -Forward addresses and computed data as needed TARGET STORE -Forward addresses of maybe dependences … WRITE-BACK COMPUTATION -Forward addresses and computed data as needed Sync Sync Thread i WRITE-BACK Sync Thread i+1 WRITE-BACK Thread i+2 Thread Pipelining Execution Model
Instruction Cache Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core Registers Registers Registers Registers PC Execution Unit PC Execution Unit PC Execution Unit PC Execution Unit Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Data Cache The Superthread Architecture
Predicted path Speculative execution Correct path Prediction result is wrong Wrong path Wrong path execution Not ready to be executed Wrong Path Execution Within Superscalar Core
Parallel region Parallel region Sequential region Kill all the wrong threads from the Previous parallel region Mark the successor threads as wrong threads Sequential region between two parallel regions Wrong thread kills itself Wrong Thread Execution
How Could Wrong Thread Execution Help Improve Performance? When i=4, j=0,1,2,3=>y[0], y[1], y[2], y[3],y[4]… When i=5, j=0,1,2,3,4 =>y[0],y[1],y[2],y[3],y[4],y[5]… for (i=0; i<10; i++) { …… for (j=0; j<i; j++) { …… x=y[j]; …… } …… } i=4 TU1 TU2 TU3 TU4 y[0] y[1] y[2] y[4] y[3] y[5] i=5 TU1 TU2 TU3 TU4 y[0] y[1] y[2] Parallelized y[4] y[3] y[5] y[6] wrong threads
Correct execution Wrong execution Operation of the WEC
Processor Configurations for Simulations SIMCA (the SIMulator for the Superthreaded Architecture) features configurations
Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks Baseline configuration
Performance of the wth-wp-wec Configuration on Top of the Parallel Execution
Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP)
Conclusions for the WEC • Allow loads to continue executing even after they are known to be incorrectly issued • Do not let them change state • 45.5% average reduction in number of misses • 9.7% average improvement on top of parallel execution • 4% average improvement over victim cache • 5.6% average improvement over next-line prefetching • Cost • 14% additional loads • Minor hardware complexity
Typical Computer Architecture Study • Find an interesting problem/performance bottleneck • E.g. Memory delays • Invent a clever idea for solving it. • This is the hard part. • Implement the idea in a processor/system simulator • This is the part grad students usually like best • Run simulations on n “standard” benchmark programs • This is time-consuming and boring • Compare performance with and without your change • Execution time, clocks per instruction (CPI), etc.
Problem #2 – Simulation in Computer Architecture Research • Simulators are an important tool for computer architecture research and design • Low cost • Faster than building a new system • Very flexible
Performance EvaluationTechniques Used in ISCA Papers * Some papers used more than one evaluation technique.
Simulation is Very Popular, But … • Current simulation methodology is not • Formal • Rigorous • Statistically-based • Never enough simulations • Design a new processor based on a few seconds of actual execution time • What are benchmark programs really exercising?
An Example -- Sensitivity Analysis • Which parameters should be varied? Fixed? • What range of values should be used for each variable parameter? • What values should be used for the constant parameters? • Are there interactions between variable and fixed parameters? • What is the magnitude of those interactions?
Let’s Introduce Some Statistical Rigor • Decreases the number of errors • Modeling • Implementation • Set up • Analysis • Helps find errors more quickly • Provides greater insight • Into the processor • Effects of an enhancement • Provides objective confidence in results • Provides statistical support for conclusions
Outline (Part 2) • A statistical technique for • Examining the overall impact of an architectural change • Classifying benchmark programs • Ranking the importance of processor/simulation parameters • Reducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 2003]
A Technique to Limit the Number of Simulations • Plackett and Burman designs (1946) • Multifactorial designs • Originally proposed for mechanical assemblies • Effects of main factors only • Logically minimal number of experiments to estimate effects of m input parameters (factors) • Ignores interactions • Requires O(m) experiments • Instead of O(2m) or O(vm)
Plackett and Burman Designs • PB designs exist only in sizes that are multiples of 4 • Requires X experiments for m parameters • X = next multiple of 4 ≥ m • PB design matrix • Rows = configurations • Columns = parameters’ values in each config • High/low = +1/ -1 • First row = from P&B paper • Subsequent rows = circular right shift of preceding row • Last row = all (-1)
PB Design • Only magnitude of effect is important • Sign is meaningless • In example, most → least important effects: • [C, D, E] → F → G → A → B
Case Study #1 • Determine the most significant parameters in a processor simulator.
Determine the Most Significant Processor Parameters • Problem • So many parameters in a simulator • How to choose parameter values? • How to decide which parameters are most important? • Approach • Choose reasonable upper/lower bounds. • Rank parameters by impact on total execution time.
Simulation Environment • SimpleScalar simulator • sim-outorder 3.0 • Selected SPEC 2000 Benchmarks • gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip2, twolf • MinneSPEC Reduced Input Sets • Compiled with gcc (PISA) at O3
Determining the Most Significant Parameters 1. Run simulations to find response • With input parameters at high/low, on/off values
Determining the Most Significant Parameters 2. Calculate the effect of each parameter • Across configurations
Determining the Most Significant Parameters 3. For each benchmark Rank the parameters in descending order of effect (1=most important, …)