250 likes | 263 Views
Explore parallel architectures, levels of parallelism, memory organization, network topologies, programming modes, performance measures, and optimization techniques.
E N D
CPS 258 Announcements • http://www.cs.duke.edu/~nikos/cps258 • Lecture calendar with slides • Pointers to related material
Parallel Architectures (continued)
Parallelism Levels • Job • Program • Instruction • Bit
Parallel Architectures • Pipelining • Multiple execution units • Superscalar • VLIW • Multiple processors
Pipelining Example for i = 1:n z(i) = x(i) + y(i); end Prologue Loop body Epilogue
Generic Computer • CPU • Memory • Bus
Memory Organization • Distributed memory • Shared memory
Network Topologies • Ring • Torus • Tree • Star • Hypercube • Cross-bar
Flynn’s Taxonomy • SISD • SIMD • MISD • MIMD
Programming Modes • Data Parallel • Message Passing • Shared Memory • Multithreaded (control parallelism)
Performance measures • FLOPS • Theoretical vs actual • MFLOPS, GFLOPS, TFLOPS • Speedup(P) = Execution time in 1 proc/time in P procs • Benchmarks • LINPACK • LAPACK • SPEC (System Performance Evaluation Cooperative)
Speedup • Speedup(P) = Best execution time in 1 proc/time in P procs • Parallel Efficiency(P) = Speedup(P)/P
Example Suppose a program runs in 10sec and 80% of the time is spent in subroutine F that can be perfectly parallelized. What is the best speedup I can achieve?
Amdahl’s Law Speedup is limited by the percentage of the code that has to be executed sequentially
“Secrets” to Success Overlap communication with computation Communicate minimally Avoid synchronizations T = tcomp + tcomm + tsync
Processors • CISC • Many and complex/multicycle instructions • Few registers • Direct access to memory • RISC • Few “orthogonal” instructions • Large register files • Access to memory only through L/S units
Common μProcessors • Intel X86 • Advanced Micro Devices • Transmeta Crusoe • PowerPC • SPARC • MIPS
Cache Memory Hierarchies • Memory speed progresses much slower than processor speed • Memory Locality • Spatial • Temporal • Data Placement • Direct mapping • Set associative • Data Replacement
Example • Matrix multiplication • As dot products • As sub-matrix products
Vector Architectures • Single Instruction Multiple Data • Exploit uniformity of operations • Multiple execution units • Pipelining • Hardware assisted loops • Vectorizing compilers
Compiler techniques for vectorization • Scalar expansion • Statement reordering • Loop • Distribution • Reordering • Merging • Splitting • Skewing • Unrolling • Peeling • Collapsing
Epilogue • Distributed memory systems win • Memory hierarchy is critical in performance • Compilers do a good job in ILP but programmers are still important • System modeling inadequate to help us tune optimal performance