250 likes | 456 Views
IBM Power 7. Harshit Dokania (hdokani2 ) Dhashrath Raguraman (raguram2) Prasanna Giridhar (giridha2). System Overview. Performance 8 Cores 4 SMT threads/core 45nm technology – 567 sq mm 1.2B transistors Integrated L1,L2 and L3( local+global ) Cache Memory On-Chip eDRAM
E N D
IBM Power 7 HarshitDokania (hdokani2 ) DhashrathRaguraman (raguram2) PrasannaGiridhar (giridha2)
System Overview • Performance • 8 Cores • 4 SMT threads/core • 45nm technology – 567 sq mm • 1.2B transistors • Integrated L1,L2 and L3(local+global) Cache • Memory • On-Chip eDRAM • 2 DDR 3 Memory Controllers • Micro Architecture • Fetch, Decode, Dispatch, Rename, Execute and Commit • 12 execution units/core • 2 integer and 4 binary floating-point units • 1 vector and 1 decimal floating-point unit • 2 load/store, 1 branch, 1 condition register
Instruction Fetch • 8 Instructions per cycle • I-Cache • 32 KB, 4 way, 128 B cache line • Tournament Branch Predictor Tag Address Translation I Cache 32KB 4 way, 16 banks 128 B line 21bit GHV 8 Instructions Hash Addr Cache Hit/Miss
Dispatch and Rename • Dispatch a group of upto 6 instructions • Physical Register Renaming • Global Completion Table
Issue • Different issue queues • Unified issue queue • Branch issue queue • Unified Issue Queue • Partitioned across threads • Complexity vs. Performance • Issue upto 8 instructions per cycle BRQ UQ0 UQ1 Branch Ex Unit FX1 LS1 VS1 FX0 LS0 VS0
Pipeline Commit DE IF EX MEM
Simultaneous Multi Threading and Mode Selection • Idea: • A single thread will be unable to keep all these resources busy • Modes • ST: Single instruction execution thread per core • SMT2: Two instruction execution threads per core • SMT4: Four instruction execution threads per core • An 8-core Power7 can execute 32 threads simultaneous • Dynamic SMT mode selection based on throughput comparison in single thread mode vs SMT mode
S80 HW Multi-thread Single thread Out of Order FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2 Way SMT POWER7 4 Way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing Comparison
Partitioning in SMT • Non partitioned resources • Complexity and larger area • Power Consumption • In SMT4 • Thread 0 and Thread 1 use • {UQ0,GRP0,FXU0,LSU0} • Thread 2 and Thread 3 use • {UQ1,GRP1,FXU1,LSU1} • In ST,SMT2 • The GPR store identical contents • Functional(FX,LS) units are shared T0, T2 T1, T3 GPR0 GPR1
Cache Hierarchy • L1 and L2 latency reduced • L3 on chip offers low latency and high bandwidth
L1 Cache • 32KB, 8 way associativity, 128B cache lines • Increased bandwidth with concurrent reads and writes • Divided into banks • Each bank has 2 read ports and 1 write port • Follows write through policy, all stores go to L2 directly
L2 Cache • All stores from L1 at low latency • 256KB, 8 way associativity • L3 local (Partial Victim Cache) management policy • determines exclusive or inclusive L3: • Exclusive provides more capacity and associativity • Inclusive helps reduce energy
L3 Cache • 32MB divided into 8 regions of 4MB each: • 8 way associativity, 128B cache lines • Each region coupled with L2 cache • L3 made of eDRAM: • eDRAM has more latency and cycle time than SRAM • But still less volume occupied and less standy • energy consumed • On chip for lower latency and higher bandwidth
L3 Cache Usage Other L3 can also use as victim cache L2 uses local L3 as the victim cache Higher weightage given to associated L2 compared to other L3
L2 Cast out Check if there is a copy of line. If not allocate a line
Lateral Cast out Check if there is a copy of line If not decide to allocate line based on capacity
L3 replacement policy • All lines of L3 divided into 2 classes: • Victim lines due to cast out by associated L2 • Victim lines due to lateral cast out by other L3 • Lines are evicted in the following order: • LRU from class 2 • If no such line exists, then LRU from class 1
Memory Subsystem Total Sustained ~ 100 GB/s MC0 MC1 Buffer Chips … DDR3 DRAM Chips …
On-Chip Interconnect Coherence Bus 4x16B Data 4x16B Data Coherence Bus
(RAS) Reliability, Availability, Serviceability • Stores previous architecture states. • On Error detection rollback. • Recovery unit is not required like POWER6 • Register renaming mechanism used. • Flushes all the instruction in pipeline • Fence the core from the rest of the system • Run an automatic built-in self-test
Energy Management • While having 4 times as many cores on the chip, POWER7 chip is in the same power envelope as POWER6. • Reasons • Reduced frequency (From 2 cores on 3.6 ~ 5 GHz to 8 cores on 2.4–4.25 GHz ) • Microarchitecture optimization(For example, combining GPR, FPR and VR) • On chip eDRAM (one fifth compared to equivalent SRAM)
Power Saving Techniques • Turn off threads in a core Remaining threads share the core execution resources • Turn off each core individually • cache, eDRAM reallocation • Better Core level performance • Per core frequency control • Operating requirements • Clock gating