160 likes | 248 Views
Dennis Abts Google . Natalie Enright Jerger University of Toronto. John Kim KAIST. Diamonds are a Memory Controller’s Best Friend*. Dan Gibson Univ of Wisconsin. Mikko Lipasti Univ of Wisconsin.
E N D
Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Diamonds are a Memory Controller’s Best Friend* Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.
Executive Summary ® On what tiles should memory controllers reside? Three-tiered simulation approach Heuristic-guided search Detailed network simulation Full-system simulation Diamond MC placement works well for on-chip meshes and tori Diamonds minimize maximum channel load Diamonds deliver lower and more predictable runtimes
Background Diverse on-chip communication Cache-to-cache LD/ST to Memory Off-chip traffic (e.g., I/O) Processors/chip on the rise Pins available for memory not rising as fast: Memory bandwidth becomes more precious Reality: Many Cores, Few Memory Controllers Tiled architectures gaining popularity Commonly employ on-chip meshes or tori
The Problem What Memory Controller placement is best overall? Flip-chip packaging allows flexible escape routes n tiles and m ports: Don’t worry, there are only configurations! What are the characteristics of the best configuration? Performance:Lowruntime for a set of objective workloads Throughput:Low latency as a function of offered load Fairness: Similar (low) average memory latency across all nodes. Predictability:Low latency and runtime variance Slight Simplification: Assume n = k2 and m = 2k
Baseline Placement: row0_7 X-Dimension Traffic Encounters Congestion on Rows with Memory Controllers • Ports to MCs located at top and bottom of chip • Conceptually similar to real parts: • Tilera’s Tile64 • 64 cores, 4 MCs (4 ports each, top/bottom of chip) • Intel TeraFLOPs • 80 cores, 2 MCs (8 ports each, top/bottom of chip)
Three-Tiered Approach Link Contention Simulation Detailed Network Simulation More Runs Shorter Runtimes More Detail Full System
Tier 0.5: Exhaustive Search It turns out is tractable for k<7 (At least on the link contention simulator – only 3,268,760 possibilities for k=5) Another Contender Patterns Emerge!
Tier 1: Heuristic-Guided Search k>6: Intractable to search all configurations Use search heuristics and random search Genetic Algorithm: Represent designs as a population of strings (Bit Vectors) Generate new designs by combining members of the population via genetic crossover(Bit Selection) Occasionally, mutate new population members (Swap adjacent bits) Reduce population size by removing least-fit members – Survival of the Fittest
Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F25100 Mutate 0x00AAF00000F25080
Link Contention Results k=8 GA Selected Diamond as most fit solution for 8x8 Minimizes MCs in a single row/column Spreads DOR load Sanity Check: GA also prefers Diamond for 4x4, 5x5, and 6x6
Network Simulation: Open-Loop Evaluation Detailed simulation of all network events (buffers, links, etc.) Cores are Bernoulli injection processes, uniform random traffic Measure latency vs. offered load
Open-Loop Results 25 20 row0_7 row2_5 Diamond X 15 Latency (cycles) 10 5 0 0 0.2 0.4 0.6 0.8 1 Offered load (flits/cycle)
Closed-Loop Evaluation Each processor executes N memory operations Up to r operations outstanding at a time Models MSHRs Uniform Random requests, and real request streams with ‘hot spot’ behavior
Closed-Loop Results 20 16 12 Number of Processors 8 4 0 3500 4000 4500 5000 5500 6000 8000 8500 9000 9500 10000 10500 11000 6500 Completion Time Diamond row0_7
Full System Results JBB WEB TPC-W+H TPC-H TPC-W Average Network Latency (cycles) for Request to Memory Controller JBB WEB TPC-H Diamond placement yields lower latency and lower latency variance. TPC-W TPC-W+H Standard Deviation
Conclusion MC Placement Matters! Diamond reduces contention, improves latency, and reduces latency/runtime variance X does fairly well