190 likes | 306 Views
L2 to Off-Chip Memory Interconnects for CMPs. Presented by Allen Lee CS258 Spring 2008 May 14, 2008. Motivation. In modern many-core systems, there is significant asymmetry between the number of cores and the number of memory access points
E N D
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008
Motivation • In modern many-core systems, there is significant asymmetry between the number of cores and the number of memory access points • Tilera’s multiprocessor has 64 cores and only 4 memory controllers • PARSEC benchmarks suggest that off-chip memory traffic increases with the number of cores for CMPs • We explore mechanisms to lower latency and power consumption for processor-memory interconnect
Tilera Tile64 • Five physical mesh networks • UDN, IDN, SDN, TDN, MDN • TDN and MDN are used for handling memory traffic • Memory requests transit TDN • Large store requests, small load requests • Memory responses transit MDN • Large load responses, small store responses • Includes cache-to-cache transfers and off-chip transfers
Tapered Fat-Tree • Good for many-to-few connectivity • Fewer hops Shorter latency • Fewer routers Less power, less area • Root nodes directly connect to memory controller • Replace MDN mesh network with two tapered fat-tree networks • One for routing requests up • One for routing responses down
Memory Model • Directory-based cache coherence • Directory cache at every node • Off-chip directory controller • Tile-to-tile requests and responses transit the TDN • Off-chip memory requests and responses transit the MDN
Synthetic Benchmarks • Statistical simulation • Model benchmarks from PARSEC suite • Based on off-chip traffic for 64-byte cache-line for 64 cores Working Set Size Small Large Sharing More Less
Breakdown of Average Latency • Latency of memory intensive applications dominated by queuing delay. • Benchmarks with little off-chip traffic save on transit time.
Power Modeling • Orion power simulator for on-chip routers from Princeton University • Models switching power as sum of • Buffer power • Crossbar power • Arbitration power • Specify parameters • Activity factor, number of input and output ports, virtual channels, size of input buffer, etc.
Parameters • 100 nm CMOS process • VDD = 1.0V • Clock Frequency = 750 MHz • 32-bit flit width
Conclusion • Physical design of the tapered fat-tree is more difficult • The TFT topology can reduce memory latency and power dissipation for many-core systems