250 likes | 396 Views
Instruction Set Issues. MIPS easy Instructions are only committed at MEM WB transition Other architectures are more difficult Instructions may update state early FP more difficult Memory updating ops (e.g. string moves). Instruction Set Issues (cont.). Difficult architectural features
E N D
Instruction Set Issues • MIPS easy • Instructions are only committed at MEMWB transition • Other architectures are more difficult • Instructions may update state early • FP more difficult • Memory updating ops (e.g. string moves)
Instruction Set Issues (cont.) • Difficult architectural features • “Odd” bits of state (e.g. condition codes) • May need saving/restoring on exceptions • Implicitly set condition codes • Complicate branch resolution • Explicit setting helps here (still a RAW hazard) • Multicycle operations • Widely differing execution times, lots of potential data hazards, etc.
Instruction Set Issues • VAX suffers from many of these problems • Solution: pipeline the microcode • Intel 32-bit 80x86 processors since 1995 use a similar approach
A.5. Handling Multicycle Operations • MIPS: FP operations • Long latency (EX repeated) • Several functional units • Structural hazards • Data hazards
DLX: FP Design • Four functional units: • Integer ALU • as before • FP multiplier • also used for integer multiplication • FP adder • addition, subtraction and conversion • FP divider • also used for integer division
Hazards • Divides • Structural hazard • Multiple register writes possible in a cycle • Out-of-order completion • WAW hazards • Exception-handling complications • RAW hazards increase
Potential RAW Hazards • Example (SPARC syntax): ldd [%fp-8], %f4 fmuld %f4, %f6, %f0 faddd %f0, %f8, %f2 std %f2, [%fp-16]
Simpler: all stalls at one point Multiple Writes • Up to four instructions may need to write in the same cycle • Solution • Track writes in ID • Stall at instruction issue • Alternatively: • Stall at MEM or WB • Stall instruction with shorter latency (may free RAW hazards)
WAW Hazards • Example: faddd %f4, %f6, %f2 … ! Integer op ldd [%fp-8], %f2
WAW Hazards (cont.) • Rare • Compiler scheduling may result in unlikely instruction sequences, so must be caught • Solutions: • Stall issue of ldd • Prevent write by faddd
Complete long before fdivd Maintaining Precise Exceptions • Out-of-order completion: fdivd %f2, %f4, %f0 faddd %f10, %f8, %f10 fsubd %f12, %f14, %f12 • Sub may cause an exception after add is complete, but not div • No longer precise
Maintaining Precise Exceptions • It may be very difficult to handle exceptions precisely • E.g. the add has destroyed one of its operands! • Four solutions: • Accept imprecise exceptions • Needed for VM & IEEE FP • Allow switching between precise and imprecise modes
Maintaining Precise Exceptions • Solutions (cont.) • Buffer results until earlier instructions complete • Buffers may grow very large, and extensive forwarding required • History files: restore original register values • Future files: store new register values • Software executes intervening instructions to get “up to date” before returning from exception
Maintaining Precise Exceptions • Solutions (cont.) • Hybrid scheme • Instructions are only issued when it is certain that preceding instructions will not cause an exception • May require stalling the pipeline
Performance of the MIPS FP Pipeline • Structural Hazards (divide unit) • Very low: 0-2 cycles per FP operation • RAW hazards • Divide: 12-24 cycles, average 14.2 • Add: 0.7-2.3 cycles, average 1.7 • In general, about 0.5 × latency
Overall MIPS FP Performance • Stalls per instruction • 0.65-1.21 cycles • Average: 0.87 • 82% from FP RAW hazards
A.6. Putting It All TogetherMIPS R4000 Pipeline • 64-bit instruction set • Eight stage pipeline • superpipelining • IF + IS: instruction fetch • RF: decode/register fetch • EX: execution • DF + DS + TC: data cache access • WB: write back
MIPS R4000 Pipeline • Performance • Load delay: two cycles • Branch delay: three cycles • Delayed branch (one cycle) • Predict-not-taken strategy, with anulling • Increased forwarding requirements • Three stages between EX and WB now
MIPS R4000 Pipeline • Floating Point • Three functional units • Divider, multiplier, adder • Shared components (8 sub-units) • Latency: 2–112 cycles • Initiation rate: 1–111 cycles • Complicated stall handling
MIPS R4000 Pipeline • Performance: • CPI between 1.2 and 2.8 for SPEC92 benchmarks • Average: 2.0 • Integer: 1.54 • FP: 2.48 • Integer apps: mainly branch delays • FP apps: mainly FP data hazard stalls (RAW)