CUDA Lecture 10 Architectural Considerations

CUDA Lecture 10Architectural Considerations Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Objective • To understand the major factors that dictate performance when using a GPU as a compute accelerator for the CPU • The feeds and speeds of the traditional CPU world • The feeds and speeds when employing a GPU • To form a solid knowledge base for performance programming in modern GPU’s • Knowing yesterday, today, and tomorrow • The PC world is becoming flatter • Outsourcing of computation is becoming easier… Architectural Considerations – Slide 2

Outline • Topic 1 (next): The GPU as Part of the PC Architecture • Topic 2: Threading Hardware in the G80 • Topic 3: Memory Hardware in the G80 Architectural Considerations – Slide 3

Recall: Typical Structure of a CUDA Program • Global variables declaration • Function prototypes • __global__ void kernelOne(…) • Main () • allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) • transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) • execution configuration setup • kernel call – kernelOne<<<execution configuration>>>( args… ); • transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) • optional: compare against golden (host computed) solution • Kernel – void kernelOne(type args,…) • variables declaration - __local__, __shared__ • automatic variables transparently assigned to registers or local memory • __syncthreads()… repeat as needed Architectural Considerations – Slide 4

Bandwidth: Gravity of Modern Computer Systems • The bandwidth between key components ultimately dictates system performance • Especially true for massively parallel systems processing massive amount of data • Tricks like buffering, reordering, caching can temporarily defy the rules in some cases • Ultimately, the performance goes falls back to what the “speeds and feeds” dictate Architectural Considerations – Slide 5

CPU Classic PC Architecture • Northbridge connects three components that must be communicate at high speed • CPU, DRAM, video • Video also needs to have first-class access to DRAM • Previous NVIDIA cards are connected to AGP, up to 2 GB/sec transfers • Southbridge serves as a concentrator for slower I/O devices Core Logic Chipset Architectural Considerations – Slide 6

(Original) PCI Bus Specification • Connected to the southBridge • Originally 33 MHz, 32-bit wide, 132 MB/sec peak transfer rate; more recently 66 MHz, 64-bit, 512 MB/sec peak • Upstream bandwidth remain slow for device (256 MB/sec peak) • Shared bus with arbitration • Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge Architectural Considerations – Slide 7

PCI as Memory Mapped I/O • PCI device registers are mapped into the CPU’s physical address space • Accessed through loads/ stores (kernel mode) • Addresses assigned to the PCI devices at boot time • All devices listen for their addresses Architectural Considerations – Slide 8

PCI Express (PCIe) • Switched, point-to-point connection • Each card has a dedicated “link” to the central switch, no bus arbitration. • Packet switches messages form virtual channel • Prioritized packets for quality of service, e.g., real-time video streaming Architectural Considerations – Slide 9

PCIe Links and Lanes • Each link consists of one more lanes • Each lane is 1-bit wide (4 wires, each 2-wire pair can transmit 2.5 Gb/sec in one direction) • Upstream and downstream now simultaneous and symmetric • Each link can combine 1, 2, 4, 8, 12, 16 lanes- x1, x2, etc. Architectural Considerations – Slide 10

PCIe Links and Lanes (cont.) • Each link consists of one more lanes • Each byte data is8b/10bencoded into 10 bits with equal number of 1’s and 0’s; net data rate 2 Gb/sec per lane each way. • Thus, the net data rates are 250 MB/sec (x1) 500 MB/sec (x2), 1GB/sec (x4), 2 GB/sec (x8), 4 GB/sec (x16), each way Architectural Considerations – Slide 11

PCIe PC Architecture • PCIe forms the interconnect backbone • Northbridge/Southbridge are both PCIe switches • Some Southbridge designs have built-in PCI-PCIe bridge to allow old PCI cards • Some PCIe cards are PCI cards with a PCI-PCIe bridge Architectural Considerations – Slide 12

Today’s Intel PC Architecture: Single Core System • FSB connection between processor and Northbridge (82925X) • Memory control hub • Northbridge handles “primary” PCIe to video/GPU and DRAM. • PCIe x16 bandwidth at 8 GB/sec (4 GB each direction) • Southbridge (ICH6RW) handles other peripherals Architectural Considerations – Slide 13

Today’s Intel PC Architecture: Dual Core System • Bensley platform • Blackford Memory Control Hub (MCH) is now a PCIe switch that integrates (NB/SB). • FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/sec per DIMM • PCIe links form backbone Source: http://www.2cpu.com/review.php?id=109 Architectural Considerations – Slide 14

Today’s Intel PC Architecture: Dual Core System (cont.) • Bensley platform • PCIe device upstream bandwidth now equal to down stream • Workstation version has x16 GPU link via the Greencreek MCH Source: http://www.2cpu.com/review.php?id=109 Architectural Considerations – Slide 15

Today’s Intel PC Architecture: Dual Core System (cont.) • Two CPU sockets • Dual Independent Bus to CPUs, each is basically a FSB • CPU feeds at 8.5–10.5 GB/sec per socket • Compared to current Front-Side Bus CPU feeds 6.4GB/sec • PCIe bridges to legacy I/O devices Source: http://www.2cpu.com/review.php?id=109 Architectural Considerations – Slide 16

Today’s AMD PC Architecture • AMD HyperTransport™ Technology bus replaces the Front-side Bus architecture • HyperTransport™ similarities to PCIe: • Packet based, switching network • Dedicated links for both directions • Shown in 4 socket configuraton, 8 GB/sec per link Architectural Considerations – Slide 17

Today’s AMD PC Architecture (cont.) • Northbridge/ HyperTransport™ is on die • Glueless logic • to DDR, DDR2 memory • PCI-X/PCIe bridges (usually implemented in Southbridge) Architectural Considerations – Slide 18

Today’s AMD PC Architecture (cont.) • “Torrenza” technology • Allows licensing of coherent HyperTransport™ to 3rd party manufacturers to make socket-compatible accelerators/co-processors Architectural Considerations – Slide 19

Today’s AMD PC Architecture (cont.) • “Torrenza” technology • Allows 3rd party PPUs (Physics Processing Unit), GPUs, and co-processors to access main system memory directly and coherently Architectural Considerations – Slide 20

Today’s AMD PC Architecture (cont.) • “Torrenza” technology • Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory. Architectural Considerations – Slide 21

HyperTransport™ Feeds and Speeds • Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-to-board interconnect such as PCIe Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture Architectural Considerations – Slide 22

HyperTransport™ Feeds and Speeds (cont.) • HyperTransport ™ 1.0 Specification • 800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way) Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture Architectural Considerations – Slide 23

HyperTransport™ Feeds and Speeds (cont.) • HyperTransport ™ 2.0 Specification • Added PCIe mapping • 1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way) Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture Architectural Considerations – Slide 24

HyperTransport™ Feeds and Speeds (cont.) • HyperTransport ™ 3.0 Specification • 1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) • Added AC coupling to extend HyperTransport ™ to long distance to system-to-system interconnect Courtesy HyperTransport ™ Consortium Source: “White Paper: AMD HyperTransport Technology-Based System Architecture Architectural Considerations – Slide 25

GeForce 7800 GTX Board Details SLI Connector Single slot cooling sVideo TV Out DVI x 2 • 256MB/256-bit DDR3 • 600 MHz • 8 pieces of 8Mx32 16x PCI-Express Architectural Considerations – Slide 26

Topic 2: Threading in G80 • Single-Program Multiple-Data (SPMD) • CUDA integrated CPU + GPU application C program • Serial C code executes on CPU • Parallel Kernel C code executes on GPU thread blocks Architectural Considerations – Slide 27

. . . . . . SPMD (cont.) CPU Serial Code Grid 0 GPU Parallel Kernel KernelA<<< nBlk, nTid >>>(args); CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args); Architectural Considerations – Slide 28

Grids and Blocks • A kernel is executed as a grid of thread blocks • All threads share global memory space Architectural Considerations – Slide 29

Grids and Blocks (cont.) • A thread block is a batch of threads that can cooperate with each other by: • Synchronizing their execution using barrier • Efficiently sharing data through a low latency shared memory • Two threads from two different blocks cannot cooperate Architectural Considerations – Slide 30

CUDA Thread Block: Review • Programmer declares (Thread) Block: • Block size 1 to 512 concurrent threads • Block shape 1D, 2D, or 3D • Block dimensions in threads CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA Architectural Considerations – Slide 31

CUDA Thread Block: Review (cont.) • All threads in a block execute the same thread program • Threads share data and synchronize while doing their share of the work • Threads have thread id numbers within block • Thread program uses thread id to select work and address shared data CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA Architectural Considerations – Slide 32

GeForce-8 Series Hardware Overview Streaming Processor Array … TPC TPC TPC TPC TPC TPC Streaming Multiprocessor Instruction L1 Data L1 Texture Processor Cluster Instruction Fetch/Dispatch SM Shared Memory SP SP TEX SP SP SFU SFU SM SP SP SP SP Architectural Considerations – Slide 33

CUDA Processor Terminology • SPA: Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) • TPC: Texture Processor Cluster (2 SM + TEX) • SM: Streaming Multiprocessor (8 SP) • Multi-threaded processor core • Fundamental processing unit for CUDA thread block • SP: Streaming Processor • Scalar ALU for a single CUDA thread Architectural Considerations – Slide 34

Streaming Multiprocessor • Streaming Multiprocessor (SM) • 8 Streaming Processors (SP) • 2 Super Function Units (SFU) • Multi-threaded instruction dispatch • 1 to 512 threads active • Shared instruction fetch per 32 threads • Cover latency of texture/memory loads Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Architectural Considerations – Slide 35

Streaming Multiprocessor (cont.) • 20+ GFLOPS • 16 KB shared memory • texture and global memory access Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Architectural Considerations – Slide 36

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB G80 Thread Computing Pipeline • The future of GPUs is programmable processing • So – build the architecture around the processor Architectural Considerations – Slide 37

Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory G80 Thread Computing Pipeline (cont.) • Processors execute computing threads • Alternative operating mode specifically for computing Generates thread grids based on kernel calls Architectural Considerations – Slide 38

Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 1) Block (1, 0) Block (2, 1) Block (2, 0) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 1) Thread (1, 2) Thread (2, 0) Thread (2, 1) Thread (2, 2) Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2) Thread Life Cycle in Hardware • Grid is launched on the streaming processor array (SPA) • Thread blocks are serially distributed to all the streaming multiprocessors (SMs) • Potentially >1 thread block per SM • Each SM launches warps of threads • 2 levels of parallelism Architectural Considerations – Slide 39

Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 1) Block (1, 0) Block (2, 1) Block (2, 0) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 2) Thread (0, 1) Thread (1, 0) Thread (1, 1) Thread (1, 2) Thread (2, 0) Thread (2, 1) Thread (2, 2) Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2) Thread Life Cycle in Hardware (cont.) • SM schedules and executes warps that are ready to run • As warps and thread blocks complete, resources are freed • SPA can distribute more thread blocks Architectural Considerations – Slide 40

MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm TF Streaming Multiprocessor Executes Blocks SM 0 SM 1 • Threads are assigned to SMs in block granularity • Up to 8 blocks to each SM as resource allows • SM in G80 can take up to 768 threads • Could be 256 (threads/block) × 3 blocks • Or 128 (threads/block) × 6 blocks, etc. Blocks Blocks Texture L1 L2 Memory Architectural Considerations – Slide 41

MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm TF Streaming Multiprocessor Executes Blocks (cont.) SM 0 SM 1 • Threads run concurrently • SM assigns/maintains thread id numbers • SM manages/schedules thread execution Blocks Blocks Texture L1 L2 Memory Architectural Considerations – Slide 42

t0 t1 t2 … t31 t0 t1 t2 … t31 Thread Scheduling/Execution • Each thread blocks is divided into 32-thread warps • This is an implementation decision, not part of the CUDA programming model • Warps are scheduling units in SM … Block 1 Warps … Block 2 Warps … … Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Architectural Considerations – Slide 43

t0 t1 t2 … t31 t0 t1 t2 … t31 Thread Scheduling/Execution (cont.) • If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? • Each block is divided into 256/32 = 8 warps • There are 8 × 3 = 24 warps • At any point in time, only one of the 24 warps will be selected for instruction fetch and execution. … Block 1 Warps … Block 2 Warps … … Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Architectural Considerations – Slide 44

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 Symmetric Multiprocessor Warp Scheduling • SM hardware implements zero-overhead warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible warps are selected for execution on a prioritized scheduling policy • All threads in a warp execute the same instruction when selected SM multithreaded Warp scheduler time ... Architectural Considerations – Slide 45

warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 Symmetric Multiprocessor Warp Scheduling (cont.) • Four clock cycles needed to dispatch the same instruction for all threads in a warp in G80 • If one global memory access is needed for every 4 instructions, a minimum of 13 warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ... Architectural Considerations – Slide 46

SM Instruction Buffer: Warp Scheduling • Fetch one warp instruction/cycle • from instruction L1 cache • into any instruction buffer slot • Issue one “ready-to-go” warp instruction/cycle • from any warp - instruction buffer slot • operand scoreboarding used to prevent hazards • Issue selection based on round-robin/age of warp • SM broadcasts the same instruction to 32 threads of a warp I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU Architectural Considerations – Slide 47

Scoreboarding • All register operands of all instructions in the instruction buffer are scoreboarded • instruction becomes ready after the needed values are deposited • prevents hazards • cleared instructions are eligible for issue Architectural Considerations – Slide 48

Scoreboarding (cont.) • Decoupled memory/processor pipelines • any thread can continue to issue instructions until scoreboarding prevents issue • allows memory/processor ops to proceed in shadow of other waiting memory/processor ops Architectural Considerations – Slide 49

Granularity Considerations • For Matrix Multiplication, should I use 4×4, 8×8, 16×16 or 32×32 tiles? • For 4×4, we have 16 threads per block. • Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. • However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! • There are 8 warps but each warp is only half full. Architectural Considerations – Slide 50

CUDA Lecture 10 Architectural Considerations