580 likes | 598 Views
Explore the advancements in memory controller architecture, coherence processors, and trade-offs in programmable coherence engines for scalable directory-based DSM multiprocessors. Our analytical model helps determine coherence bandwidth needs for parallel applications.
E N D
Integrated Memory Controllers with Parallel Coherence Streams Mainak Chaudhuri Mark Heinrich IIT Kanpur University of Central Florida
Talk in One Slide • Ever-increasing on-die integration • Faster memory controllers and coherence processors • Leads to new trade-offs in the domain of programmable coherence engines for scalable directory-based DSM multiprocessors • We show that multiple coherence engines are unnecessary in such environments • We develop a useful analytical model to quickly decide the coherence bandwidth requirement of parallel applications Parallel Coherence Streams
Sketch • Background • Memory controller architecture • Analytical model • Evaluation framework • Simulation results • Validation of model for directory protocols • Directory-less broadcast protocols • Multiprogramming • Summary Parallel Coherence Streams
Background: Integrated MC • A direct solution to reduce round-trip cache miss latency • Other advantages related to maintenance and glueless multiprocessing • Widely accepted in high-end industry • Alpha 21364, IBM Power5, AMD Opteron, Sun UltraSPARC III and IV, Sun Niagara • Shared memory multiprocessors employing iMC are naturally DSMs • Bandwidth-thrifty directory coherence is the choice Parallel Coherence Streams
Background: Directory Processing • Home-based coherence protocols • Each cache block has a home node • Upper few bits of physical address • Each coherence request (miss or dirty eviction from the last level of cache) is first sent to the home node of the cache block • At home node, sharing information of the cache block is maintained in a data structure called directory (can be in SRAM or DRAM) • Coherence controller of the home looks up directory and takes appropriate actions • Each node has at least one embedded directory coherence controller Parallel Coherence Streams
Background: Directory Processing • Two different trends in directory coherence controller architecture • Hardwired controllers • Less flexible, tedious verification, often affects project’s critical path, but high-performance • MIT Alewife, KSR1, SGI Origin, Stanford DASH • Custom programmable controllers • Executes protocol software on a protocol processor embedded in memory controller • Flexible in choice of protocol, easier to verify the protocol, loss of performance • Compaq Piranha, Opteron-Horus, Stanford FLASH, Sequent STiNG, Sun S3.mp Parallel Coherence Streams
Background: Flexible Processing • Past research reports up to 12% performance loss [Stanford FLASH] • Main reason why industry is shy of pursuing this option • Coherence controller occupancy has emerged as the most important parameter • Naturally, hardwired controllers get an upper hand • Past research has established the importance of multiple hardwired controllers in SMP nodes Parallel Coherence Streams
Background: Flexible Processing • New technology often changes the trade-offs • Reconsider programmable directory controllers in the light of increased integration • Bring the programmable controller on die • Faster clock rates lead to lowered occupancy • New research questions: • Can the integrated programmable controllers offer enough coherence bandwidth? • Do we need multiple of those? • Can the integrated controllers cope up with the extra pressure of emerging multi-threaded nodes? Parallel Coherence Streams
Background: Flexible Processing • Executes coherence protocol handlers in software with hardware support • Does not require interrupts • Two major architectures • Integrated custom protocol processor(s) • Can use one or more simple cores (this work considers one or two static dual-issue in-order cores with dedicated one level of caches) • Reserved protocol thread context(s) in a simultaneous multi-threaded (SMT) node • SMTp: SMT with one or more protocol contexts • Eliminates the protocol processor Parallel Coherence Streams
Background: Flexible Processing OOO SMT Core (ATs) In-order PP OOO SMT Core (ATs+PTs) IL1 DL1 IL1 DL1 IL1 DL1 SMTp PP L2 cache L2 cache MC MC SDRAM Banks SDRAM Banks Router Router Parallel Coherence Streams
Aside: A Protocol Handler • Computes directory address from requested address (simple hash) • Loads directory entry into a register • Computes coherence actions based on directory state and header • Simple integer arithmetic • Sends out coherence messages as needed • Custom instructions or uncached stores to write header and address to send unit • May carry cache line data read from DRAM Parallel Coherence Streams
Scope of this Work • Distributed shared memory (DSM) NUMA • Up to 16 nodes, directory-based or directory-less broadcast coherence • Each node has an SMT processor capable of running four application threads, an integrated memory controller, integrated one or two protocol processors (PPs) or protocol threads (PTs), integrated hypercube router • Six parallel applications and 4-way multiprogrammed workloads • Applicability to multi-node chip-MPs Parallel Coherence Streams
Contributions • Two primary contributions • Evaluates two kinds of programmable coherence engines in the light of on-die integration and multi-threading • Develops a simple and generic analytical model relating the directory protocol occupancy, DRAM latency, and DRAM channel bandwidth • Introduces the concept of occupancy margin • Offers valuable insights on hot-spot situations • Accurately predicts whether an additional coherence engine is helpful Parallel Coherence Streams
Highlights • Key results • Single integrated programmable controller offers sufficient coherence bandwidth for directory-based systems (contrast with off-chip controller studies) • Analytical model helps explain why typical hot-spot situations (involving locks and flags) cannot be improved with parallel coherence stream processing • Directory-less broadcast systems (e.g., AMD Opteron) enjoy significant benefit from parallel coherence stream processing with multiple programmable “snoop” engines Parallel Coherence Streams
Sketch • Background • Memory controller architecture • Analytical model • Evaluation framework • Simulation results • Validation of model for directory protocols • Directory-less broadcast protocols • Multiprogramming • Summary Parallel Coherence Streams
Memory Controller Architecture: PP Processor Router VCs Software Queue PI Inbound NI Inbound SWQ Head Round Robin Dispatch CAM Lookup PPWQ OMB Icache Protocol Processor Dcache SDRAM Banks Send Unit PI Out NI Out Parallel Coherence Streams
Memory Controller Architecture: SMTp Processor Router VCs Software Queue PI Inbound NI Inbound SWQ Head Round Robin Dispatch CAM Lookup PPWQ OMB Protocol Thread SDRAM Banks Send Unit PI Out NI Out Parallel Coherence Streams
Memory Controller Architecture: SMTp • Protocol thread participates in ICOUNT fetching when PC is valid • Shares pipeline resources with application threads including the entire cache hierarchy • Deadlock avoidance with reserved resources • Queue buffers, branch checkpoints, integer registers, integer and LS queue buffers, store buffers, MSHRs Parallel Coherence Streams
Parallel Coherence Streams • Replicated resources • Multiple protocol processors or protocol threads • Multiple OMBs • Multi-ported Icache and Dcache for protocol processor (does not apply to SMTp) • Control flow • Mutual exclusion in directory access: requires six-ported CAM lookup in OMB and PPWQ • Schedule a message every cycle • Protocol processors/threads arbitrate for PPWQ read port; smallest id wins (dynamic priority) Parallel Coherence Streams
Parallel Coherence Streams • Critical sections • Conventional LL/SC and test-and-set locks • For higher throughput test-and-set locks are maintained in on-chip registers • Software queue and its related states (e.g., occupancy) are the major shared read/write variables • Leads to increased average dynamic instruction count per handler • Trades occupancy of individual handler with concurrency across handlers Parallel Coherence Streams
Parallel Coherence Streams • Boot sequence • Only one protocol processor/thread executes the entire boot sequence initializing the memory controller and peripheral states • The other processors/threads only initialize their architectural register states • Out-of-order issue • Address conflict between six heads and PPWQ/OMB may lead to idle schedule cycles • Consider all requests in the five queues, not just the heads (SWQ is still FIFO) • Queues need to be collapsible with address CAMs Parallel Coherence Streams
Sketch • Background • Memory controller architecture • Analytical model • Evaluation framework • Simulation results • Validation of model for directory protocols • Directory-less broadcast protocols • Multiprogramming • Summary Parallel Coherence Streams
Analytical Model • Goal of the model is to decide if a second protocol engine can improve performance • The model is applicable to any system exercising directory-based protocols • Not just limited to systems with integrated controllers • We analyze the time spent by a batch of requests in the memory system • Focus only on handling of read and read-exclusive at home (most time consuming) • These require DRAM access (initiated speculatively) Parallel Coherence Streams
Analytical Model • Three parts in the life of a request after it is dispatched • DRAM occupancy or access latency (Om) • Protocol handler occupancy or protocol processing latency (Op) • DRAM channel occupancy or cache line transfer latency (Oc) • We look at four scenarios involving two concurrently arriving bank-parallel requests (consider only Om > Op) • Single- and dual-channel DRAM controller with one and two protocol engines Parallel Coherence Streams
Single-channel DRAM Controller R1 M1 C1 O1 1PPU R2 M2 O2 C2 R1 O1 M1 C1 2PPU What if 2Op > Om + 2Oc ? R2 O2 M2 C2 Parallel Coherence Streams
Single-channel DRAM Controller R1 O1 M1 C1 1PPU R2 M2 C2 O2 R1 O1 M1 C1 2PPU Saved: 2Op – (Om + 2Oc) R2 O2 M2 C2 Parallel Coherence Streams
Dual-channel DRAM Controller R1 M1 C1 O1 1PPU R2 M2 O2 C2 R1 O1 M1 C1 2PPU What if 2Op > Om + Oc ? R2 O2 M2 C2 Parallel Coherence Streams
Dual-channel DRAM Controller R1 O1 M1 C1 1PPU R2 M2 C2 O2 R1 O1 M1 C1 2PPU Saved: 2Op – (Om + Oc) R2 O2 M2 C2 Parallel Coherence Streams
General Formulation • Burst arrival of requests: k at a time • Single-channel DRAM controller: • Total protocol occupancy must get exposed if adding a second coherence engine has to be effective • Required condition: kOp > Om + kOc • Re-arranging: Op > Om/k + Oc • Dual-channel DRAM controller: • Required condition: kOp > Om + kOc /2 • Re-arranging: Op > Om/k + Oc /2 • Occupancy margin: left minus right Parallel Coherence Streams
Take-away Points • For highly bursty requests (high k) • Om has diminishing effect • Balance between Op and Oc becomes the most important determinant: tension between two competing bandwidths • For small k • A large occupancy margin is unlikely, as the contribution from Om would be large • Adding a second coherence engine would not be useful: less concurrency • Extra DRAM bandwidth may convert a negative occupancy margin to positive Parallel Coherence Streams
Hot-spots and Bank Conflicts • Bank conflicts can delay DRAM accesses of a burst of requests • Hot-spots often arise from access to the same cache block system-wide: an obvious case of bank conflict • Can multiple coherence engines help? • Since Om > Op on average, for two conflicting requests the total protocol occupancy (2Op) is hidden under memory access latency (2Om) • Multiple coherence engines will not improve performance Parallel Coherence Streams
Hot-spots and Bank Conflicts • What about row buffer hits? • The first request in a batch will suffer from a row buffer miss • The subsequent ones will enjoy hits with high probability • Row buffer hits lower the average value of Om and may uncover portions of kOp even in the case of k conflicting requests • Required condition: kOp > Omiss + (k-1)Ohit + kOc/w for w-channel DRAM • Simplifying: Op > Om + Oc /w which contradicts Om > Op (typical of integrated controllers) Parallel Coherence Streams
Sketch • Background • Memory controller architecture • Analytical model • Evaluation framework • Simulation results • Validation of model for directory protocols • Directory-less broadcast protocols • Multiprogramming • Summary Parallel Coherence Streams
Evaluation Framework • Evaluates both integrated protocol processors (PPs) and threads in SMTp • Depending on the integration level the memory controller and PP can be clocked at different frequencies • Explores 400 MHz, 800 MHz, 1.6 GHz with 1.6 GHz main SMT processor • Protocol threads in SMTp, by design, always run at full frequency (1.6 GHz) • Each node has an SMT processor and runs up to four ATs (64-threaded apps) and two PTs Parallel Coherence Streams
Flashback: Flexible Processing OOO SMT Core (ATs) In-order PP OOO SMT Core (ATs+PTs) IL1 DL1 IL1 DL1 IL1 DL1 SMTp PP L2 cache L2 cache MC MC SDRAM Banks SDRAM Banks Router Router Parallel Coherence Streams
Sketch • Background • Memory controller architecture • Analytical model • Evaluation framework • Simulation results • Validation of model for directory protocols • Directory-less broadcast protocols • Multiprogramming • Summary Parallel Coherence Streams
400 MHz Protocol Processors: Execution Time 8% Parallel Coherence Streams
400 MHz Protocol Processors: Dispatcher’s Wait Cycles Parallel Coherence Streams
400 MHz Protocol Processors: Occupancy Parallel Coherence Streams
Model Validation: 400 MHz PP • Oc is fixed at 20 ns (128B @ 6.4 GB/s) • Predict from 1PP measurements: Op (ns) Om (ns) kmax OM (ns) FFT 30.3 54.6 5 -0.6 FFTW 29.4 54.3 7 1.6 LU 25.3 40.1 6 -1.4 Ocean 36.6 54.4 7 8.8 Radix-Sort 33.8 45.8 4 2.3 Water 28.1 40.0 4 -1.9 Parallel Coherence Streams
1.6 GHz Protocol Processors: Execution Time Parallel Coherence Streams
1.6 GHz Protocol Processors: Dispatcher’s Wait Cycles Parallel Coherence Streams
1.6 GHz Protocol Processors: Occupancy Parallel Coherence Streams
Model Validation: 1.6 GHz PP • Oc is fixed at 20 ns (128B @ 6.4 GB/s) • Predict from 1PP measurements: Op (ns) Om (ns) kmax OM (ns) FFT 8.4 54.4 12 -16.1 FFTW 7.5 53.9 16 -15.9 LU 6.3 40.2 16 -16.2 Ocean 9.4 54.6 16 -14.0 Radix-Sort 8.8 45.8 8 -16.9 Water 7.2 40.0 8 -17.8 Parallel Coherence Streams
SMTp: Execution Time Parallel Coherence Streams
SMTp: Dispatcher’s Wait Cycles Parallel Coherence Streams
SMTp: Occupancy Parallel Coherence Streams
Model Validation: SMTp • Oc is fixed at 20 ns (128B @ 6.4 GB/s) • Predict from 1PP measurements: Op (ns) Om (ns) kmax OM (ns) FFT 15.3 55.2 10 -10.2 FFTW 15.6 57.6 14 -8.5 LU 13.1 42.3 16 -9.5 Ocean 17.5 55.9 16 -6.0 Radix-Sort 16.6 51.8 11 -8.1 Water 14.4 40.1 8 -10.6 Parallel Coherence Streams
Summary: Execution Time Parallel Coherence Streams
Take-away Points • Doubling controller frequency is always better than adding a second one • Reducing individual handler occupancy is more important than reducing the occupancy of a burst • Ocean shows the importance of burst mix • Increasing frequency has diminishing return • Instead of building complex hardwired protocol engines, dedicate a thread or core in the emerging processors Parallel Coherence Streams