CPS 258 Announcements

CPS 258 Announcements • http://www.cs.duke.edu/~nikos/cps258 • Lecture calendar with slides • Pointers to related material

Parallel Architectures (continued)

Parallelism Levels • Job • Program • Instruction • Bit

Parallel Architectures • Pipelining • Multiple execution units • Superscalar • VLIW • Multiple processors

Pipelining Example for i = 1:n z(i) = x(i) + y(i); end Prologue Loop body Epilogue

Generic Computer • CPU • Memory • Bus

Memory Organization • Distributed memory • Shared memory

Shared Memory

Distributed Memory

Interleaved Memory

Network Topologies • Ring • Torus • Tree • Star • Hypercube • Cross-bar

Flynn’s Taxonomy • SISD • SIMD • MISD • MIMD

Programming Modes • Data Parallel • Message Passing • Shared Memory • Multithreaded (control parallelism)

Performance measures • FLOPS • Theoretical vs actual • MFLOPS, GFLOPS, TFLOPS • Speedup(P) = Execution time in 1 proc/time in P procs • Benchmarks • LINPACK • LAPACK • SPEC (System Performance Evaluation Cooperative)

Speedup • Speedup(P) = Best execution time in 1 proc/time in P procs • Parallel Efficiency(P) = Speedup(P)/P

Example Suppose a program runs in 10sec and 80% of the time is spent in subroutine F that can be perfectly parallelized. What is the best speedup I can achieve?

Amdahl’s Law Speedup is limited by the percentage of the code that has to be executed sequentially

“Secrets” to Success Overlap communication with computation Communicate minimally Avoid synchronizations T = tcomp + tcomm + tsync

Processors • CISC • Many and complex/multicycle instructions • Few registers • Direct access to memory • RISC • Few “orthogonal” instructions • Large register files • Access to memory only through L/S units

Common μProcessors • Intel X86 • Advanced Micro Devices • Transmeta Crusoe • PowerPC • SPARC • MIPS

Cache Memory Hierarchies • Memory speed progresses much slower than processor speed • Memory Locality • Spatial • Temporal • Data Placement • Direct mapping • Set associative • Data Replacement

Example • Matrix multiplication • As dot products • As sub-matrix products

Vector Architectures • Single Instruction Multiple Data • Exploit uniformity of operations • Multiple execution units • Pipelining • Hardware assisted loops • Vectorizing compilers

Compiler techniques for vectorization • Scalar expansion • Statement reordering • Loop • Distribution • Reordering • Merging • Splitting • Skewing • Unrolling • Peeling • Collapsing

Epilogue • Distributed memory systems win • Memory hierarchy is critical in performance • Compilers do a good job in ILP but programmers are still important • System modeling inadequate to help us tune optimal performance

CPS 258 Announcements

CPS 258 Announcements

Presentation Transcript

CPS 365

CPS 196.2

UNIT 258

CPS 258 Announcements

CPS Clickers

CPS

CPS organization

Page 258

CPS 173

CPS Integration

CPS Question

CPS 214

CPS eInstruction

CPS

Est-258

Page 258

BUSN 258 Course Real Knowledge / busn 258 dotcom

CPS 296.1

CPS Update

CPS 223