John Cavazos Dept of Computer & Information Sciences University of Delaware

John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 Lecture 3 Laws, Equality, and Inside a Cell

Lecture 2: Overview • Know the Laws • All are NOT Created Equal • Inside a Cell

Two Important Laws • Amdahl’s Law • Gene Amdahl observation in 1967 • Speedup is limited by serial portions • Assumes fixed workloads and fixed problem size • Gustafson’s Law • John Gustafson observation in 1988 • Rescues parallel processing from Amdahl’s Law • Proposes fixed time and increasing work • Sequential portions have diminishing effect

Amdahl’s Law Sequential 100 100 100 Sequential 100 100 Sequential Parallelize parts 2 and 4 with 2 processors Sequential 100 50 100 100 Sequential 100 50 100 Sequential Speedup: 25%

Amdahl’s Law (cont’d) Sequential 100 100 100 Sequential 100 100 Sequential Parallelize parts 2 and 4 with 4 processors Sequential 100 25 50 100 100 Sequential 25 100 50 100 Sequential Speedup: 40%

Amdahl’s Law (cont’d) Sequential 100 100 100 Sequential 100 100 Sequential Parallelize parts 2 and 4 with infinite processors Sequential 100 0 25 50 100 100 Sequential 0 25 100 50 100 Sequential Speedup: only 70% Multicore doesn’t look very appealing!

Gustafson’s Law (cont’d) Sequential 100 100 100 Sequential 100 100 Sequential Boxes contain units of work now! 500 units of time, but 700 units of work! Sequential 100 200 100 100 Sequential 100 200 100 Sequential Speedup: 40%

Gustafson’s Law (cont’d) Sequential 100 100 100 Sequential 100 100 Sequential Boxes contain units of work now! 500 units of time, but 1100 units of work! Sequential 100 400 200 100 100 Sequential 400 100 200 100 Sequential Speedup: 220%

Gustafson Law (cont’d) • Gustafson found important observation • As processors grow, people scale problem size • Serial bottlenecks do not grow with problem size • Increasing processors gives linear speedup • 20 processors roughly twice as fast as 10 • This is why supercomputers are successful • More processors allows increased dataset size Reference: http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html

All Multicores Not Equal • Multicore CPUs and GPUs are very different! • CPUs run general purpose programs well • GPUs run graphics (or similar prgs) well • General Purpose Programs have • Less parallelism • More complex control requirements • GPU programs • Highly parallel • Arithmetic intense • Simple control requirements

Floating-Point Operations 32-bit FP operations per second GPUs : more computational units and take better advantage of them. Slide Source: NVIDIA CUDA Programming Guide 1.1

CPUs versus GPUs CPUs devote lots of area to control and storage. GPUs devote most area to computational units. Slide Source: NVIDIA CUDA Programming Guide 1.1

CPU Programming Model • Scalar programming model • No native data parallelism • Few arithmetic units • Very small area • Optimized for complex control • Optimized for low latency not high bandwidth Slide Source: John Owens, EEC 227 Graphics Arch course

AMD K7 “Deerhound” Slide Source: John Owens, EEC 227 Graphics Arch course

GPU Programming Model • Streams • Collections of data records • Data parallelism amenable • Kernels • Inputs/outputs are streams • Performs computation on each element of stream • No dependencies between stream elements • Stream storage • Not cache (input read once/output written once) • Producer-consumer locality Slide Source: John Owens (EEC 227 Graphics Arch) and Pat Hanrahan (Stream Prog. Env., GP^2 Workshop)

Cell B.E. Design Goals • An accelerator extension to Power • Exploits parallelism and achieves high frequency • Sustain high memory bandwidth through DMA • Designed for flexibility • Heterogenous architecture • PPU for control, general-purpose • SPU for computation-intensive, little control • Applicable to a wide variety of applications The Cell Architecture has characteristics of both a CPU and GPU.

Cell Chip Highlights • 241M Transistors • 9 cores, 10 threads • >200 GFlops (SP) • >20 GFlops (DP) • > 300 GB/s EIB • 3.2 GHz shipping • Top freq. 4.0 GHz (in lab) Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Cell Details • Heterogenous multicore architecture • Power Processor Element (PPE) for control tasks • Synergistic Processor Element (SPE) for data-intensive processing • SPE Features • No cache • Large unified register file • Synergistic Memory Flow Control (MFC) • Interface to high-perf. EIB Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Cell PPE Details • Power Processor Element (PPE) • General Purpose 64-bit PowerPC RISC processor • 2-way hardware threaded • L1 32KB I; 32KB D • L2 512 KB • For operating systems and program control Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Cell SPE Details • Synergistic Processor Element (SPE) • 128-bit SIMD architecture • Dual Issue • Register File 128x128-bit • Load Store (256KB) • Simplified Branch Arch. • No hardware BR predictor • Compiler-managed hint • Memory Flow Controller • Dedicated DMA engine - Up to 16 outstanding requests Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Compiler Tools • Gnu based C/C++ compiler (Sony) • ppu-gcc/ppu-g++ - generates ppu code • spu-gcc/spu-g++ - generates spu code • Gdb debugger • Supports both PPU and SPU debugging • Different modes of execution Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Compiler Tools • The XLC/C++ compiler • ppuxlc/ppuxlc++ - generates ppu code • spuxlc/spuxlc++ - generates spu code • Includes the following optimization levels • -O0: almost no optimization • -O2: strong, low-level optimization • -O3: intense, low-level opts with basic loop opts • -O4: all of -O3 and detaild loop analysis and good whole program analysis • -O5: all of -O4 and detailed whole-program analysis Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Performance Tools • Gnu-based tools • Oprofile - System level profiler (only PPU) • Gprof - generates call graphs • IBM Tools • Static analysis tool (spu_timing) • annotates assembly file with scheduling and instruction issue estimates • Dynamic analysis tool (CellBE system simulator) • Can run your code on an X86 machine • Can collect a variety of statistics Slide Source: Michael Perrone, MIT 6.189 Fall 2007 course

Compiling with the SDK • README_build_env.txt (You should IMPORTANT!) • Provides details on the build environment features, including files, structure and variables. • make.footer • Specifies all of the build rules needed to properly build binaries • Must be included in all SDK Makefiles (referenced relatively if $CELL_TOP is not defined) • Includes make.header • make.header • Specifies definitions needed to process the Makefiles • Includes make.env • make.env • Specifies the default compilers and tools to be used by make • make.footer and make.header should not be modified Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

Compiling with the SDK • Defaults to gcc • Set in make.env with three variables set to gcc or xlc • PPU32_COMPILER • PPU64_COMPILER • PPU_COMPILER [overrides PPU32_COMPILER and PPU64_COMPILER] • SPU_COMPILER • Can change from the command line • PPU_COMPILER=xlc SPU_COMPILER=xlc make • make -e PPU64_COMPILER:=gcc -e PPU32_COMPILER:=gcc -e SPU_COMPILER:=gcc • export PPU_COMPILER=xlc SPU_COMPILER=xlc ; make Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

Compiling with the SDK • Use CELL_TOP or maintain relative directory structure ifdef CELL_TOP include $(CELL_TOP)/make.footer else include ../../../make.footer endif Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

Makefile variables • DIRS • list of subdirectories to build first • PROGRAM_ppu PROGRAMS_ppu • 32-bit PPU program (or list of programs) to build. • PROGRAM_ppu64 PROGRAMS_ppu64 • 64-bit PPU program (or list of programs) to build. • PROGRAM_spu PROGRAMS_spu • SPU program (or list of programs) to build. • If written as a standalone binary, can run without being embedded in a PPU program. Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

Makefile variables (cont’d) • LIBRARY_embed LIBRARY_embed64 • Creates a linked library from an SPU program to be embedded into a 32-bit or 64-bit PPU program. • CC_OPT_LEVEL • Optimization level for compiler to use • CFLAGS, CFLAGS_gcc, CFLAGS_xlc • Additional flags for compiler to use (general or specific to gcc/xlc) • TARGET_INSTALL_DIR • Specifies where built targets are installed Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

Sample Project Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0

Next Time • Chapters 1-3 • NVIDIA CUDA Programming Guide version 1.1 • And all of • Chapter 29 from GPU Gems 2 • Links on website

John Cavazos Dept of Computer & Information Sciences University of Delaware