230 likes | 460 Views
Memory Hierarchy- Power 5 Processor. By Ajay Mathews Cheriyan Jian Wang Shoaib Akram. Topics. Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching. POWER-X Introduction. IBM POWER: - Performance Optimization With Enhanced RISC
E N D
Memory Hierarchy-Power 5 Processor By Ajay Mathews Cheriyan Jian Wang Shoaib Akram
Topics Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching
POWER-X Introduction IBM POWER: - Performance Optimization With Enhanced RISC Used in many of IBM’s servers, workstations and supercomputers POWER to POWER3: (1990~1998) POWER4: (2001) - dual-core chips, up to 8 cores, most powerful at the time POWER5: (2004) - memory system improvement - support simultaneous multithreading(SMT) POWER6: (may, 2007) - advanced interchip communication technology - double performance of POWER5 POWER7: (currently in development)
POWER5 Chip Overview P2 P1 L2 MC L3 Ctrl
L1 Cache L1 caches are not shared between processors, - each core has its own cache - LRU replacement (VS FIFO in POWER4) L1 I-cache: - size: 64KB/processor - associative: 2-way (VS direct mapped in POWER4) - line size: 128 bytes - write policy: N/A L1 D-cache: - size: 32KB/processor - associative: 4-way (VS 2-way in POWER4) - line size: 128 bytes - write policy: write-through - transfer rate: 4 words/cycle
L2 Cache Memory address L3 controller and L3 • three identical slices • shared between processors • (Memory address) mod 3 = slice ID • three independent controllers • each controller manages one slice • each can deliver 32B/cycle(60.8GB/sec) to L1 • Total Size: 1.9MB (VS 1.4MB) • Associative: 10-way (VS 8-way) • Latency: 13 cycles, 6.8 ns (VS 12/7.1) • Line size: 128 bytes • Write policy: write back • Transfer rate: 4 words/cycle
L3 Cache • 3 identical slices • Each attaches to one of L2 slice • off-chip, shared between processors • L3 directory/control is on-chip • This design reduce the off-chip delay • Total Size: 36MB (VS 32MB) • Associative: 12-way (VS 8-way) • Latency: 87 cycles, 45.8 ns (VS 123/72.3) • Line size: 256 bytes • Write policy: write back • Transfer rate: <1 words/cycle
Structure Optimizations on POWER4(1) • L3 removed from path between chip-MC: - why? heavy traffic on FBC (16 chips) can reduce latency to the L3 (physically closer to CPUs) - L3 act as victim cache of L2 - now operate at ½ the processor clock rate, while this rate is 1/3 in POWER4 - this opt increases bandwidth by ½, and reduce latency by roughly 1/3 * FBC—Fabric Bus Controller
MC integrated into the chip: - different path Proc-L3 and Proc-memory - Increase potential operational parallelism and bandwidth - significantly reduce latency, eliminating communi delays to external controller The benefit of these optimization: - latency L3: 72.3ns VS 45.8ns - latency memory: 206ns VS 116ns - bandwidth(4P): 8.37GB/s VS 17.9GB/s Structure Optimizations on POWER4(2)
Next Topic Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching
Memory Subsystem Path followed by requests L2 Controller Fabric Controller Memory Controller P P P P L2 L2 FBC FBC MC MC To physical memory
Memory Controller read/write reorder queues Schedular–selects operations from queues FIFO based arbiter queue
Memory Controller Separate queues Increasing capacity increases clock cycle Read and write reordering is done differently Eight enteries per queue Centralized FIFO queue Prevenents CPU stalls when memory controller under stress Scheduler Selects operations from queues.
Interface to Physical Memory Buses operate at twice the DRAM speed Memory protection. ECC Memory scrubbing Meomory Controller read/write/command buses Matches bus-width between MC and DIMMs SMI (*4) DIMM (*2) DIMM (*2) Dual In-memory modules
Memory Requests & Multichips -Ring Topology -Snooping Mechanism -Combined Response -Fabric Buses P ------- AB SB RB DB
Next Topic Memory Hierarchy Overview Memory Subsystem Virtual Memory and Prefetching
Virtual Memory • 64 bit virtual address and 50 bit real address • Two steps to address translation • Effective address translated to virtual address using a 64 entry segment lookaside buffer(SLB) • Virtual address translated to real address using page table. • Page table cached in a 1024 entry, 4 way set associative TLB
Virtual Memory (Contd.) • For fast translation , 2 first level translation tables are used-1for instructions and 1 for data • They provide fast effective to real address translation.SLB and TLB looked up only if miss in the first level translation. • Data Translation Table – 128 entry fully associative array • Instruction Translation Table – 64 entry two way set associative array. • Entries in both tables are tagged with thread number and are not shared between threads. • TLB entries can be shared between threads.
Prefetching When load instructions miss sequential cache lines , prefetch engine initiates accesses to following cache lines before being referenced by future load instructions. L1 data cache prefetch initiated when load references data from new cache line and a new line is loaded into L2 from memory. Latency for retrieving data from memory is more. So prefetch engine requests data from memory 12 lines ahead of the line being referenced by the load.
Prefetching (Contd.) • Hardware ramps up prefetch slowly requiring an additional 2 sequential misses before it reaches steady state prefetch sequencing. • Software prefetching is also supported to indicate the number of lines to prefetch using hardware. • Advantages • Improves performance by eliminating the initial ramp up. • Only required number of lines are prefetched. • Eight software prefetch streams are supported per processor.
Prefetching (Contd.) • Upon a cache miss: • Biased guess is made as to the direction of that stream • Guess is based upon where in the cache line the address associated with that miss occurred • If it is in the first 3/4, then the direction is guessed as ascending • If in the last 1/4, the direction is guessed descending. • Instruction prefetch is also present in Power5 processor with 4 instruction prefetch buffers ( 2 per thread).
Main Memory Both DDR and DDR2 DIMMs can be used with Power5 SMI chips are provided to support the connection of the DIMMs with the processor. Support is provided for 2 (2 SMI mode) or 4 (4 SMI mode) chips. Each SMI chip has two ports and each port can support up to 2 DIMMs