580 likes | 668 Views
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads. Dhiraj D. Kalamkar, Intel Mainak Chaudhuri, Mark Heinrich, IIT Kanpur University of Central Florida. Talk in One Slide.
E N D
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads Dhiraj D. Kalamkar, Intel Mainak Chaudhuri, Mark Heinrich, IIT Kanpur University of Central Florida
Talk in One Slide • Address re-mapping improves performance of important kernels • Vertical reduction and transpose in this talk • But requires custom hardware support in memory controller • Address translation, cache line assembly • We move this hardware support to software running on a dir. protocol thread • Can be a thread in SMT or a core in CMP • Enjoys 1.45 and 1.29 speedup for reduction and transpose on 16-node DSM multiproc. Simplifying Active Memory Clusters
Sketch • Background • Active Memory Techniques and the AMDU • Flexible Directory Controller Architecture • Deconstructing the AMDU • Parallel Reduction • Simulation Environment • Simulation Results • Summary Simplifying Active Memory Clusters
Background: AM Techniques • Focus on two kernels in this work • Parallel vertical reduction of an array of vectors and matrix transpose • Consider vertically reducing each column of an NxN matrix to produce a 1xN vector • For the ease of page distribution a block-row partitioning among processors is carried out • Each processor reduces its portion into a private 1xN vector • A merge phase accumulates the private contributions Simplifying Active Memory Clusters
Background: Parallel Reduction P0 P1 P2 P3 P0 P1 all-to-all P2 P3 Simplifying Active Memory Clusters
for j=0 to N-1 p_x[pid][j] = e; for i=pid*(N/P) to (pid+1)*(N/P)-1 p_x[pid][j] = p_x[pid][j] + A[i][j]; BARRIER for j=pid*(N/P) to (pid+1)*(N/P)-1 for i=0 to P-1 x[j] = x[j]+p_x[i][j]; BARRIER Subsequent uses of x Background: AM Parallel Reduction Do not want this Simplifying Active Memory Clusters
Background: AM Parallel Reduction Special shadow space (not backed by memory) P0 P1 P2 P3 Cache eviction, Merge in MC Physical result vector Simplifying Active Memory Clusters
for j=0 to N-1 p_x[pid][j] = e; for i=pid*(N/P) to (pid+1)*(N/P)-1 p_x[pid][j] = p_x[pid][j] + A[i][j]; BARRIER for j=pid*(N/P) to (pid+1)*(N/P)-1 for i=0 to P-1 x[j] = x[j]+p_x[i][j]; BARRIER Subsequent uses of x /* AM optimized */ x’ = AMInstall (x, N, sizeof (long long)); for j=0 to N-1 for i=pid*(N/P) to (pid+1)*(N/P)-1 x’[pid][j] = x’[pid][j]+A[i][j]; BARRIER Subsequent uses of x Background: AM Parallel Reduction Simplifying Active Memory Clusters
Background: Memory Control • Memory controller does the following • Identify requests to the shadow space (easy) • Send identity cache block to local processor and handle coherence in background • Identify shadow space writebacks and accumulate the data in the evicted block with the in-memory data (requires a translation) • On a normal space request, retrieve corresponding shadow blocks from processors (P shadow blocks contribute to one normal block), accumulate them with in-memory data, and send the final result • Merge removed from critical path Simplifying Active Memory Clusters
Background: Translation • Suppose the memory controller receives a shadow writeback to address A • If starting shadow address of the result vector is S, the offset is A-S • S is a fixed number decided by the hardware and OS designers; also, shadow space is contiguous • Add A-S to the starting virtual address of the result vector (recall: starting virtual address is communicated to MC via AMInstall) • Look up memory-resident TLB with this address to get the physical address of the data to be written back Simplifying Active Memory Clusters
Background: Memory Control Data Buffer Pool Merger Shadow Writeback Network Interface Coherence Engine TLB Physical Block Router SDRAM Simplifying Active Memory Clusters
Background: AMDU Pipeline Coherence Engine Base Address VA Calc. Virtual Address SDRAM/ Data Buffer AMTLB Pref. AMTLB Ph. Addr. Dir. Addr. Calc. Directory Address Application Data Msg. Buffer Simplifying Active Memory Clusters
Background: Flexibility • Flexibility was a primary goal of AM • Do not want to add new hardware for every new AM optimization • Two key components to achieve this goal • A general-enough AMDU • Integrate control code of AMDU into software coherence protocol running on the coherence engine • Coherence engine itself is a simple processor core in a CMP or a thread context in SMT • This work eliminates the AMDU and achieves maximum possible flexibility Simplifying Active Memory Clusters
Background: Flexible Coherence OOO Core In-order Core AT PT PCPL1 IL1 DL1 IL1 DL1 SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Background: Flexible Coherence OOO Core In-order Core AT PT PCSL2 IL1 IL1 DL1 SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Background: Flexible Coherence OOO Core In-order Core AT PT PCSL2PL1 IL1 DL1 IL1 DL1 SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Background: Flexible Coherence OOO SMT Core AT PT SMTp IL1 DL1 SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Contributions • Two major contributions • First implementation of AM techniques without any custom hardware in MC • Brings AM closer to adoption in commodity systems • Evaluation of new flexible AM protocols on four different directory controller architectures • Innovative use of contemporary dual-core and SMT nodes Simplifying Active Memory Clusters
Sketch • Background • Active Memory Techniques and the AMDU • Flexible Directory Controller Architecture • Deconstructing the AMDU • Parallel Reduction • Simulation Environment • Simulation Results • Summary Simplifying Active Memory Clusters
Deconstructing the AMDU • Involves efficiently emulating the AMDU pipeline in the protocol code • Virtual address calculation is easy: one shift and one 32-bit addition • Directory address calculation is easy: one shift by constant amount and one 40-bit addition • Challenging components • TLB • Merger • Dynamic cache line gather/scatter (needed for transpose) Simplifying Active Memory Clusters
Deconstructing the AMDU: TLB • Two design options • Emulate a direct-mapped software TLB in the protocol data area • Each entry holds tag, translation, permission bits, valid bit • Hit/miss detection in software • On a miss, invoke page walker or access memory-resident page table • Advantage: can be larger than a hardware TLB • Share the application TLB: easy in SMTp, but • Requires extra port or interferes with app. threads • Other three architectures: floor-planning issues • Not explored in this work Simplifying Active Memory Clusters
Deconstructing the AMDU: TLB • Emulating a TLB in protocol software • Handling a TLB miss requires the page table base address • Don’t want to trap to kernel • Load the page table base address in an architectural register of protocol thread at the time of application launch (this register cannot be used by protocol compiler) • TLB shootdown now needs to worry about keeping the soft TLB coherent • Must invalidate the TLB in the protocol data area • Starting address and size of the TLB area should be made known to the TLB shootdown kernel Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger MEB Data Buffer Pool Merger Shadow Writeback MYB Network Interface Coherence Engine TLB Physical Block Router SDRAM Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger • Naïve approach • Writeback data is in message buffer (MEB) and physical block is loaded into a memory buffer (MYB) • Protocol thread can access 64 bits of data at 8-byte aligned offsets within a data buffer through uncached loads/stores (data buffer pool is memory-mapped) • Load 64 bit data from MYB and MEB into two general purpose registers, merge them, store the result back to same offset in MYB • At the end of the loop, write MYB back to memory Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger MEB Data Buffer Pool RF UC Load+ MYB Add+ UC Load+ UC Store+ UC Store Physical Block SDRAM Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger • Naïve approach • For 128B data buffers, 32 uncached loads, 16 uncached stores (as opposed to 16 cycles in AMDU pipe) • Worse: uncached operations are often implemented as non-speculative in processor pipes • Improvement: caching the buffers • Data buffers are already memory-mapped • Treat them as standard cached memory • Now can use cached loads and stores which can issue speculatively and can be pipelined Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger • Caching the buffers • The cache block(s) holding MYB must be flushed to memory at the end of all the merges • Use writeback invalidate instruction available on all microprocessors • The cache block(s) holding MEB must be invalidated at the end of all the merges • Otherwise next time when the same buffer is used, there is a danger that the protocol thread may see stale data in cache • Use invalidate instruction available on all microprocessors Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger MEB Data Buffer Pool RF MYB Add+ Miss L+ L+ UC Store S+ Fill Fill Inv Physical Block Miss Flush D$ SDRAM Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger • Why flush the cache block(s) holding MYB at the end of the merge? • Recall that there are P shadow blocks corresponding to one physical block • Potentially there could be P writeback operations accessing the same physical block • Caching each physical block across merges (not just during a merge) improves reuse • Cannot cache at the same physical address though: coherence problem in shared cache • Use a different address space: 0x2345680 gets cached at 0xc2345680 (bypasses MYB) Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger • Caching across merges • Flush the blocks at the end of all the merges (potentially P in number) • Can be decided by the protocol thread from the directory state (shadow owner vector) • Problem: the address of the flushed block is slightly different from the actual physical address (in higher bits) • Memory controller anyway ignores address bits higher than installed DRAM capacity • Must still flush MEB after every merge as usual Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger MEB Data Buffer Pool RF {Add+}+ {Miss}+ {L+}+ {L+}+ {Fill}+ {S+}+ Flush {Inv}+ Physical Block Miss/Fill D$ SDRAM Simplifying Active Memory Clusters
Deconstructing the AMDU: Merger • A three-dimensional optimization space • Caching (C) or not caching (U) MEB during a merge, MYB during a merge, merge results across merges • UUU, UCU, …, CCC Not viable CUC CCC Caching MEB hurts CUU CCU UUC Best performing UCC UUU UCU Simplifying Active Memory Clusters
Sketch • Background • Active Memory Techniques and the AMDU • Flexible Directory Controller Architecture • Deconstructing the AMDU • Parallel Reduction • Simulation Environment • Simulation Results • Summary Simplifying Active Memory Clusters
Simulation Environment • Each node is dual-core with one OOO SMT core and one in-order core • On-die memory controller and router • All components are clocked at 2.4 GHz • SMT core has 32 KB IL1, DL1 (dual-ported), and 2 MB L2 (3-cycle tag hit), 18-stage pipe • DRAM bandwidth 6.4 GB/s per channel, 40 ns page hit, 80 ns page miss • Hop time 10 ns, link bandwidth 3.2 GB/s, 2-way bristled hypercube • 16 nodes, each node capable of running up to two application threads Simplifying Active Memory Clusters
Simulation Environment OOO Core In-order Core AT PT PCPL1_128KB PCPL1_2MB IL1 DL1 IL1 DL1 32 KB SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Simulation Environment OOO Core In-order Core AT PT PCSL2 IL1 IL1 DL1 32 KB SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Simulation Environment OOO Core In-order Core AT PT PCSL2PL1 IL1 DL1 IL1 DL1 32 KB 128 KB SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Simulation Environment OOO SMT Core AT PT SMTp IL1 DL1 SDRAM L2 Memory Control Router AMDU Simplifying Active Memory Clusters
Benchmark Applications • Parallel reduction [All prefetched] • Mean Square Average (MSA) [micro] • DenseMMM: C = ATB • SparseFlow: flow computation in sparse multi-source graph • Spark98 kernel: SMVP • Transpose [Prefetched, tiled] • Transpose [micro] • SPLASH-2 FFT: only forward transform • Involves three tiled transpose phases • FFTW: forward and inverse Simplifying Active Memory Clusters
Sketch • Background • Active Memory Techniques and the AMDU • Flexible Directory Controller Architecture • Deconstructing the AMDU • Parallel Reduction • Simulation Environment • Simulation Results • Summary Simplifying Active Memory Clusters
Simulation Results • Two key questions to answer • How much speedup does our design achieve over a baseline that does not use AM protocols (both without the AMDU)? • How much performance penalty do we pay due to the elimination of the hardwired AMDU? Simplifying Active Memory Clusters
Simulation Results: Spark98 No performance loss Close 20% speedup Simplifying Active Memory Clusters
Result Summary: Reduction • Very encouraging results • Architectures without AMDU comes within 3% of architectures with complex AMDU • SMTp+UCC and PCSL2PL1+UCC are the most attractive architectures • 45% and 49% speedup with 16 application threads compared to non-AM baseline Simplifying Active Memory Clusters
Simulation Results: 1D FFT 4.1% gap 2.3% gap Simplifying Active Memory Clusters
Result Summary: Transpose • Within 13.2% of AMDU performance • On average 8.7% gap • SMTp+SoftTr delivers 29% speedup • PCSL2PL1+SoftTr delivers 23% speedup • Flashback: reduction summary • Within 3% of AMDU performance • 45% and 49% speedup of SMTp+UCC and PCSL2PL1+UCC • Architecturally, SMTp is more attractive (area overhead is small), but PCSL2PL1 may be easier to verify Simplifying Active Memory Clusters
Prior Research • Impulse memory controller introduced the concept of address re-mapping • Used in single-threaded systems • Software-directed cache flush for coherence • Active memory leveraged cache coherence to do address re-mapping • Allowed seamless extensions to SMPs and DSMs • Introduced AMDU and flexibility in AM • This work closes the loop by bringing AM closer to commodity Simplifying Active Memory Clusters
Summary • Eliminates custom hardware support in memory controller traditionally needed for AM • Parallel reduction performance comes within 3% of AMDU • Transpose performance comes within 13.2% of AMDU (lack of efficient pipelining) • Protocol thread architecture achieves 45% and 29% speedup for reduction and transpose • Protocol core architecture with private L1 and shared L2 achieves 49% and 27% Simplifying Active Memory Clusters
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads THANK YOU! Dhiraj D. Kalamkar, Intel Mainak Chaudhuri, Mark Heinrich, IIT Kanpur University of Central Florida
for i=pid*(N/P) to (pid+1)*(N/P)-1 for j=0 to N-1 sum += A[i][j]; BARRIER Transpose (A, A’); BARRIER for i=pid*(N/P) to (pid+1)*(N/P)-1 for j=0 to N-1 sum += A’[i][j]; BARRIER Transpose (A’, A); BARRIER /* AM optimized transpose */ A’ = AMInstall (A, N, N, sizeof(Complex)); for i=pid*(N/P) to (pid+1)*(N/P)-1 for j=0 to N-1 sum += A[i][j]; BARRIER for i=pid*(N/P) to (pid+1)*(N/P)-1 for j=0 to N-1 sum += A’[i][j]; BARRIER Background: Matrix Transpose Simplifying Active Memory Clusters
Background: Memory Control Data Buffer Pool Assembled block Shadow Put Network Interface Shadow Get Coherence Engine AMDU Gather Router SDRAM Simplifying Active Memory Clusters