900 likes | 1.11k Views
POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer. Hsien-Hsin S. Lee Georgia Tech with Dong Hyuk Woo Georgia Tech Josh Fryman Intel Corporation Allan Knies Intel Research Berkeley Marsha Eng Intel Corporation. Performance Scalability Area Scalability
E N D
POD: A Heterogeneous Many-Core Platform Using a 3D-integrated Acceleration Layer Hsien-Hsin S. Lee Georgia Tech with Dong Hyuk Woo Georgia Tech Josh Fryman Intel Corporation Allan Knies Intel Research Berkeley Marsha Eng Intel Corporation
Performance Scalability • Area Scalability • How many cores on a die? • Power Scalability • How much power per core? ‘Many-Core’? Are We There Yet? • What is new? • On-Die Performance Scalability
Cooler jet Pure copper Cooligy’s channel 3D Cooler Pro PS3 Grill Cooking-Aware Computing (Source: Kevin Skadron, UVa) Power Issue 1: Thermal Control
Power Issue 2: Supply Google Server Farms (Oregon)
Scalable Power Consumption? * The above pictures (>=8) do not represent any current or future Intel products.
A Thousand Cores, Power Consumption? • UIUC: Programming Model for one thousand cores [DAC-44] • Mark Horowitz (Stanford): You will never see a one-thousand core processor [C2S2/GSRC review, Dec 2007] ? ? ? ? ?
Some Known Facts on Interconnection Interconnection Network related • One main energy hog! • Alpha 21364: • 20% (25W / 125W) • MIT RAW: • 40% of individual tile power • UT TRIPS: • 25% • Mesh-based MIMD many-core? • Excessive energy in its router logic • Unpredictable communication pattern • Unfriendly to low-power techniques
Many Many-Cores . . . To Name a Few P* : Homogeneous c* : Homogeneous (Energy Efficient) P+c*: Heterogeneous
Power-Equivalent Performance P+c* c* P* Woo and Lee. To appear in IEEE Computer
Power-Equivalent Performance per Joule P+c* c* P* Woo and Lee. To appear in IEEE Computer
Woo, Fryman, Knies, Eng, and Lee. To appear in IEEE MICRO, July 2008 PODParallel-On-DemandParallel-On-Die
Modern Constraints and Issues • Area & Power scalability (efficiency) • On-chip wire delay • Backward/forward binary compatibility • Extensibility • Low NRE
Instruction Bus (IBUS) ORTree A “Snap-On” Broad-Purpose Accelerator Interleaved 2D Torus Links Data Return Buffer (DRB) Memory Bus (MBUS) Processing Element Memory ring Row Response Queue (RRQ) External TLB (xTLB) x86 Processor Arbiter (ARB) Memory Controller L2 Cache
3D Die Stacking, Extending Moore’s Law Analog Layer Face to Back DRAM Layer Back to Back Cache Layer Face to Face Processor Layer Heat Sink
3D-integrated POD Die 0 Die 1 You live and work here You get your beer here
Host Processor Host Processor
POD Array (with 8x8 PEs) Host Processor
Instruction Bus (IBus) Host Processor
ORTree: Status Monitoring Flag status Host Processor
DRB DRB DRB DRB DRB DRB DRB DRB Data Return Buffer (DRB) Host Processor
DRB 0 7 1 6 2 5 3 4 DRB DRB DRB DRB DRB DRB DRB 2D Torus P2P Links Host Processor
DRB DRB DRB DRB DRB DRB DRB DRB 2D Torus P2P Links Host Processor
RRQ DRB RRQ DRB RRQ DRB RRQ DRB DRB RRQ RRQ DRB RRQ DRB RRQ DRB RRQ & MBus: DMA Row Response Queue Memory Bus Host Processor
DRB RRQ RRQ DRB System memory DRB RRQ RRQ DRB MC DRB RRQ MC RRQ DRB RRQ DRB RRQ DRB ARB xTLB LLC On-chip MC / Ring Memory Controller Arbitrator Host Processor
RRQ RRQ D D RRQ D D RRQ D D RRQ D D RRQ D D RRQ D D RRQ D D Wire Delay Aware Design D D D D D D D EFLAGS ! MemFence IBUS
Simplified PE Pipelining N S E W N S E W DEC / RF G Local SRAM X 96-bit VLIW instruction M 80 MBUS 96 IBUS
Inter-PE Communication • Fully synchronized and orchestrated by the host processor • No crossbar • No contention / arbitration • No large buffer of conventional on-chip routers • Nothing unpredictable from the standpoint of programmers!
Memory Organization LLC Ctrl Ring: ~2B Address Ring: ~8B RRQn-1 MCj-1 System memory PE Array RRQ Ctrl Ring: ~2B MC0 RRQ0 xTLB Data Ring: ~66B ARB Worst-case: ~78B LLC x86 Host Core
Execution Model • Heterogeneous ISAs • Host Processor (e.g., x86 CISC) • For backward compatibility • RISC-type VLIW POD • For area- and power-efficiency • Host orchestrates the entire execution • Instruction Encapsulation • SendBits (x86 extension) • Followed by 96-bit POD VLIW instructions
Programming Model • Example code: Reduction char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 }
Programming Model (2x2 POD) • Malloc memory at the host memory space char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 }
Programming Model (2x2 POD) • Initialize ai char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 0 17: 0 18: 0 19: 0
Programming Model (2x2 POD) • Load the address of my PE ID char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
Programming Model (2x2 POD) • Load my PE ID char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } r15 2 r15 2 r15 2 r15 2 base_addr: 16 16: 1 17: 2 18: 3 19: 4
Programming Model (2x2 POD) • Broadcast base_addr char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } r15 2 r15 3 r15 0 r15 1 base_addr: 16 16: 1 17: 2 18: 3 19: 4
Programming Model (2x2 POD) • Calculate the pointer to my ai char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } r14 16 r14 16 r15 2 r15 3 r14 16 r14 16 r15 0 r15 1 base_addr: 16 16: 1 17: 2 18: 3 19: 4
Programming Model (2x2 POD) • Load my ai from the host memory space char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } r14 16 r14 16 r15 18 r15 19 r14 16 r14 16 r15 16 r15 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4
Programming Model (2x2 POD) • Wait until it is loaded char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } r14 16 r14 16 r15 18 r15 19 r14 16 r14 16 r15 16 r15 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4
Programming Model (2x2 POD) • Update S char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } r1 3 r1 4 r14 16 r14 16 r15 18 r15 19 r1 1 r1 2 r14 16 r14 16 r15 16 r15 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 r1 r1 r1 3 1 2 4 r2 r2 r2 r2 2 1 4 3 r14 r14 r14 r14 16 16 16 16 r15 r15 r15 r15 19 16 17 18 Programming Model (2x2 POD) • 1st iteration char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 r1 r1 r1 3 1 2 4 r2 r2 r2 r2 2 1 4 3 r14 r14 r14 r14 16 16 16 16 r15 r15 r15 r15 19 16 17 18 Programming Model (2x2 POD) • Transfer my ai to my east-side neighbor char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 3 r1 4 r2 3 r2 4 r14 16 r14 16 r15 18 r15 19 r1 1 r1 2 r2 1 r2 2 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • Update S char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 4 r1 3 r2 7 r2 7 r14 16 r14 16 r15 18 r15 19 r1 2 r1 1 r2 3 r2 3 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • Copy rowsum to r1 char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 7 r1 7 r2 7 r2 7 r14 16 r14 16 r15 18 r15 19 r1 3 r1 3 r2 3 r2 3 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • 1st iteration char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 r1 r2 7 r2 7 r14 16 r14 16 r15 18 r15 19 r1 r1 r2 3 r2 3 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • Transfer my rowsum to my north-side neighbor char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 7 7 3 3 base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 7 r1 7 r2 7 r2 7 r14 16 r14 16 r15 18 r15 19 r1 3 r1 3 r2 3 r2 3 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • Update S char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 3 r1 3 r2 10 r2 10 r14 16 r14 16 r15 18 r15 19 r1 7 r1 7 r2 10 r2 10 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • Done! char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } base_addr: 16 16: 1 17: 2 18: 3 19: 4
r1 3 r1 3 r2 10 r2 10 r14 16 r14 16 r15 18 r15 19 r1 7 r1 7 r2 10 r2 10 r14 16 r14 16 r15 16 r15 17 Programming Model (2x2 POD) • Broadcast the pointer to ‘result’ int result; POD_movl( r14, &result ); $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r2 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 128: <- addr of ‘result’
r1 3 r1 3 r2 10 r2 10 r14 128 r14 128 r15 18 r15 19 r1 7 r1 7 r2 10 r2 10 r14 128 r14 128 r15 16 r15 17 Programming Model (2x2 POD) • Only PE0 will write-back! int result; POD_movl( r14, &result ); $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r2 $ASM popmask POD_mfence(); base_addr: 16 128: <- addr of ‘result’ 16: 1 17: 2 18: 3 19: 4
r1 3 r1 3 r2 10 r2 10 r14 128 r14 128 r15 2 r15 2 r1 7 r1 7 r2 10 r2 10 r14 128 r14 128 r15 2 r15 2 Programming Model (2x2 POD) • Load PE_ID int result; POD_movl( r14, &result ); $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r2 $ASM popmask POD_mfence(); base_addr: 16 128: <- addr of ‘result’ 16: 1 17: 2 18: 3 19: 4