OR682/Math685/CSI700

OR682/Math685/CSI700 Lecture 12 Fall 2000

High Performance Computing • Computer architectures • Computer memory • Floating-point operations • Compilers • Profiling • Optimization of programs

My Goals • Provide you with resources for dealing with large computational problems • Explain the basic workings of high-performance computers • Talk about compilers and their capabilities • Discuss debugging (this week) and profiling (12/13) tools in Matlab

Changes in Architectures • Then (1980s): • supercomputers (cost: $10M and up) • only a few in existence (often at government laboratories); custom made • (peak) speed: several hundred “megaflops” (millions of floating-point operations per second) • Now: • (clusters of) microprocessors (inexpensive) • can be easily assembled by almost anyone • commercial, off-the-shelf components • (peak) speed: gigaflops and higher

Modern “Supercomputers” • Multiprocessor • Based on commercial RISC (reduced instruction set computer) processors • Linked by high-speed interconnect or network • Communication by message passing (perhaps disguised from the user) • Hierarchy of local/non-local memory

Why Learn This? • Compilers have limited ability to match your algorithm/calculation to the computer • You will be better able to write software that will execute efficiently, by playing to the strengths of the compiler and the machine

Some Basics • Memory • main memory • cache • registers • Languages • machine • assembly • high-level (Fortran, C/C++) • Matlab?

Microprocessors • Old Technology: CISC (complex instruction set computer) • assembly language instructions that resembled high-level language instructions • many tasks could be performed in hardware • reduced (slow) memory fetches for instructions • reduced (precious) memory requirements

Weaknesses of CISC? • None until relatively recently • Harder for compilers to exploit • Complicated processor design • hard to fit on a single chip • Hard to pipeline • pipeline: processing multiple instructions simultaneously in small stages

RISC Processors • Reduce # of instructions, and fit processor on a single chip (faster, cheaper, more reliable) • Other operations must be performed in software (slower) • All instructions the same length (32 bits); pipelining is possible • More instructions must be fetched from memory • Programs take up more space in memory

Early Examples • First became prominent in (Unix-based) scientific work stations: • Sun • Silicon Graphics • Apollo • IBM RS-6000

Characteristics of RISC • Instruction pipelining • Pipelining of floating-point operations • Uniform instruction length • Delayed branching • Load/Store architecture • Simple addressing modes • Note: modern RISC processors are no longer “simple” architectures

Pipelines • Clock & clock speed (cycles) • Goal: 1 instruction per clock cycle • Divide instruction into stages, & overlap: • instruction fetch (from memory) • instruction decode • operand fetch (from register or memory) • execute • write back (of results to register or memory)

Complications • Complicated memory fetch • stalls pipeline • Branch • may be a “no op” [harmless] • otherwise, need to flush pipeline (wasteful) • Branches occur every 5-10 instructions in many programs

Pipelined Floating-Point • Execution of a floating-point instruction can take many clock cycles (especially for multiplication and division) • These operations can also be pipelined • Modern hardware has reduced the time for a f-p operation to 1-3 cycles

Uniform Instruction Length • CISC instructions came in varying length • length not known until it was decoded • this could stall the pipeline • For RISC processors, instructions are uniform length (32 bits) • no additional memory access required to decode instruction • Pipeline flows more smoothly

Delayed Branches • Branches lead to pipeline inefficiencies • Three possible approaches: • branch delay slot • potentially useful instruction inserted (by compiler) after the branch instruction • branch prediction • based on previous result of branch during execution of program • conditional execution (next slide)

Conditional Execution • Replace a branch with a conditional instruction: IF (B<C) THEN A=D ELSE A=E END becomes COMPARE B<C IF TRUE A=D IF FALSE A=E • Pipeline operates effectively.

Load/Store Architectures • Instructions limit memory references: • only explicit load and store instructions (no implicit or cascaded memory references) • only one memory reference per instruction • Keeps instructions the same length • Keeps pipeline simple (only one execution stage) • Memory load/store requests are already “slower” (complications would further stall the pipeline) • by the time the result is needed, the load/store is complete (you hope)

Simple Addressing Models • Avoid: • complicated address calculations • multiple memory references per instruction • Simulate complicated requests with a sequence of simple instructions

2nd Generation RISC Processors • Faster clock rate (smaller processor) • “Superscalar” processors: Duplicate compute elements (execute two instructions at once) • hard for compiler writers, hardware designers • “Superpipelining”: double the number of stages in the pipeline (each one twice as fast) • Speculative computation

For Next Class • Homework: see web site • Reading: • Dowd: chapters 3, 4, and 5

OR682/Math685/CSI700

OR682/Math685/CSI700

Presentation Transcript

OR682/Math685/CSI700