310 likes | 426 Views
Adam Kunk Anil John Pete Bohman. UIUC - CS 433 IBM POWER7. Quick Facts. Released by IBM in 2010 (~ February) Successor of the POWER6 Implements IBM PowerPC architecture v2.06 Clock Rate: 2.4 GHz - 4.25 GHz Feature size: 45 nm ISA: Power ISA v 2.06 (RISC) Cores: 4, 6, 8
E N D
Adam Kunk Anil John Pete Bohman UIUC - CS 433 IBM POWER7
Quick Facts • Released by IBM in 2010 (~ February) • Successor of the POWER6 • Implements IBM PowerPC architecture v2.06 • Clock Rate: 2.4 GHz - 4.25 GHz • Feature size: 45 nm • ISA: Power ISA v 2.06 (RISC) • Cores: 4, 6, 8 • Cache: L1, L2, L3 – On Chip References: [1], [5]
Why the POWER7? • PERCS – Productive, Easy-to-use, Reliable Computer System • DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) • Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project. • IBM, Cray, and Sun Microsystems received HPCS grant for Phase II. • IBM was chosen for Phase III in 2006. References: [1], [2]
Blue Waters • Side note: • The Blue Waters system was meant to be the first supercomputer using PERCS technology. • But, the contract was cancelled (cost and complexity).
History of Power POWER8 POWER7/7+ POWER6/6+ MostPOWERful & Scalable Processor in Industry POWER5/5+ FastestProcessor In Industry POWER4/4+ Hardware Virtualization for Unix & Linux • 4,6,8 Core • 32MB On-Chip eDRAM • Power Optimized Cores • Mem Subsystem ++ • 4 Thread SMT++ • Reliability + • VSM & VSX • Protection Keys+ • 45nm, 32nm • Dual Core • High Frequencies • Virtualization + • Memory Subsystem + • Altivec • Instruction Retry • Dyn Energy Mgmt • 2 Thread SMT + • Protection Keys • 65nm First Dual Corein Industry • Dual Core & Quad Core Md • Enhanced Scaling • 2 Thread SMT • Distributed Switch + • Core Parallelism + • FP Performance + • Memory bandwidth + • 130nm, 90nm • Dual Core • Chip Multi Processing • Distributed Switch • Shared L2 • Dynamic LPARs (32) • 180nm, 2001 2004 2007 2010 Future References: [3]
POWER7 Demo • IBM POWER7 Demo
Core Core Core Core S M P F A B R I C P O W E R L2 L2 L2 L2 G X B U S L2 L2 L2 L2 Memory Interface Core Core Core Core Memory++ POWER7 Layout Cores: • 8 Intelligent Cores / chip (socket) • 4 and 6 Intelligent Cores available on some models • 12 execution units per core • Out of order execution • 4 Way SMT per core • 32 threads per chip • L1 – 32 KB I Cache / 32 KB D Cache per core • L2 – 256 KB per core Chip: • 32MB Intelligent L3 Cache on chip L3 Cache eDRAM References: [3]
POWER7 Options (8, 6, 4 cores) References: [3]
POWER7 Core • Each core implements “aggressive” out-of-order (OoO) instruction execution • The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues • Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]
Instruction Fetch • 8 inst. fetched from L2 to L1 I-cache or fetch buffer • Balanced instruction rates across active threads • Inst. Grouping • Instructions belonging to group issued together • Groups contain independent instructions
Instruction Fetch • Branch Prediction
Execution Units • Each POWER7 core has 12 execution units: • 2 fixed point units • 2 load store units • 4 double precision floating point units (2x power6) • 1 vector unit • 1 branch unit • 1 condition register unit • 1 decimal floating point unit References: [4]
SMT • Simultaneous Multithreading • SMT1: Single instruction execution thread per core • SMT2: Two instruction execution threads per core • SMT4: Four instruction execution threads per core • This means that an 8-core Power7 can execute 32 threads simultaneously
S80 HW Multi-thread Single thread Out of Order FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2 Way SMT POWER7 4 Way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing Thread 2 Executing Thread 3 Executing Multithreading History References: [3]
Memory Access • (Look at section 2.1.4 in http://www.redbooks.ibm.com/redpapers/pdfs/redp4639.pdf)
L1 Cache • 2 read ports, 1 write port • Write has higher priority over a read • Write-Through • No L1 cast-outs required • B-Tree LRU replacement • Way prediction bits reduce hit latency
L2 Cache • Inclusive of L1 • L3 partial victim relationship
L3 Cache • Details of the L3 Cache …. (leads up to eDRAM)
eDRAM • eDRAM – Embedded dynamic random-access memory • This means the L3 cache (shared 32 MB) is on-chip • Essentially faster due to decreased distance • Less area, less power, on-chip interconnects provide each core with 32-byte buses to and from the L3 cache • Side note: eDRAM is also used in many different game consoles (PS2, GameCube, Wii, Etc.) References: [5], [6]
eDRAM (cont.) • eDRAM in the POWER7 provides 1/6 the latency and twice the bandwidth (compared with off-chip eDRAM), and 1/5 standby power in 1/3 the required area (compared with SRAM) References: [5]
References • 1. http://en.wikipedia.org/wiki/POWER7 • 2. http://en.wikipedia.org/wiki/PERCS • 3. Central PA PUG POWER7 review.ppt • http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCEQFjAA&url=http%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fwikis%2Fdownload%2Fattachments%2F135430247%2FCentral%2BPA%2BPUG%2BPOWER7%2Breview.ppt&ei=3El3T6ejOI-40QGil-GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE-v3S_5t3A
References (cont.) • 4. http://www.redbooks.ibm.com/redpapers/pdfs/redp4639.pdf • 5. http://www.serc.iisc.ernet.in/~govind/243/Power7.pdf • 6. http://en.wikipedia.org/wiki/EDRAM