210 likes | 364 Views
CMP Design Choices. Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone. Outline. Introduction Assumptions Plackett & Burman Analysis Simulation methods Statistical Design Plackett & Burman Results Mean Value Analysis MVA Implementation MVA Results
E N D
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone
Outline • Introduction • Assumptions • Plackett & Burman Analysis • Simulation methods • Statistical Design • Plackett & Burman Results • Mean Value Analysis • MVA Implementation • MVA Results • AMVA Implementation • AMVA Results • Complementary Results • Conclusions
Introduction • 2 part study • Design space is huge, how can we reduce it? • Method 1 • Plackett & Burman (PB) Analysis finds critical parameters • Design uses extreme values of parameters • Detailed architecture design can focus on a few parameters
Introduction (cont.) • Method 2 • Mean Value Analysis Model of a CMP • Simply designed to compute throughput • Design choices can be narrowed down quickly • Intuition is gained and patterns/parameter relationships identified
Assumptions - PB Design • In-Order approximated as OoO with small window • Die Size = 300 mm2 (16 MB Cache @ 65nm) • L2 Cache Size expanded to fill the die • Discrete sizes: 4, 8, 12 MB • Associativity can be non-power-of-2 • Core size measured in Cache Byte Equivalents:
Simulation Methodology • Simics with Ruby & Opal • 16P sims used cache warmup files • 2P sims ran for more transactions • Attempted OLTP and JBB benchmarks
Plackett & Burman Design • Motivation • Narrow a huge design space • Minimize simulation runs (experiments) • Preliminaries • Performance Measure • Extreme Parameter Values • Number of Parameters (N < 4Xn-1)
PB Results • Extreme Values stressed the simulator • Have not completed an entire set of runs, yet • Possibly necessary to build a custom L2 network for each run
Assumptions - MVA • Distribution of time between memory requests is exponential • Processor cores exhibit the same average behavior with respect to their service times and miss rates. • Doubling the size of the cache reduces the miss rate by a factor of 1/√2 • An inorder core takes approximately the same area as 50 KB of cache
MVA Design • Simple Closed Model:
MVA Design • Two phases of this Model design • First: Use the exact MVA equations • Use average time between memory access as an application parameter • Solve for throughput • Second: Use Approximate MVA (AMVA) • Use an iterative method to converge on this service time • Solve for throughput
Exact MVA • To solve for the MVA equations, we determine the mean residence time at all service centers: • Rp – processor/L1 residence time • RL2 – L2 residence time • RM – memory residence time. • The case with one core is trivial. Use this case to solve for additional cores • Rn,p = Dp * (1 + Qn-1,p)
Exact MVA results • Using data from simulation runs throughput was calculated • Miss rates, number of memory requests • Results are erratic • Not consistent with simulation results • Source of the problem is most likely processor service time!
Approximate MVA Design • An iterative method can be used to converge on a service time • Uses total R as an input parameter • Iterative method works well with approximate MVA • Goal is to match total average residence time of a memory request
Approximate MVA Results • Convergence using the AMVA equations does not always occur • Total measured residence time cannot be reached with this model and parameter set. • Variation of input values without convergence implies flaws in the model structure • There is a complex relationship between the memory system and the rate at which a core issues requests that must be modeled
Complementary Results • Initial goal to produce PB Results to find parameters to focus on for MVA Model • Results from both approaches could cross-verify correctness
Conclusions • Simics has a STEEP learning curve • <5 weeks is not enough time for valid/any results • Refinement of a PB Design leads to long lead times on valid results • CMPs complicate the relationship between cores and memory subsystem • Design methodologies that focus simulation runs are necessary • More results and conclusions to follow
Questions • Questions?