360 likes | 497 Views
Lecture 5: Interrupts, Superscalar. Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008. Admin. Homework #1 Due Today Homework #2 Assigned Reading H&P Chapter 2 & 3 (suggested) Research papers (not yet ready to read, but will be soon!):
E N D
Lecture 5: Interrupts, Superscalar Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008
Admin • Homework #1 Due Today • Homework #2 Assigned • Reading • H&P Chapter 2 & 3 (suggested) • Research papers (not yet ready to read, but will be soon!): • Hinton et al: “The Microarchitecture of the Pentium 4 Processor” • Palacharla, Jouppi, and Smith: “Complexity-Effective Superscalar Processors” • Akkary, Rajwar, and Srinivasan: “Checkpoint Processing and Recovery”
Review: Hazards Data Hazards • RAW • only one that can occur in simple 5-stage pipeline • WAR, WAW • Data Forwarding (Register Bypassing) • send data from one stage to another bypassing the register file • Still have load use delay Structural Hazards • Replicate Hardware, scheduling Control Hazards • Compute condition and target early (delayed branch)
Review: Dynamic Branch Prediction • Solution: 2-bit counter where prediction changes only if mispredict twice: • Increment for taken, decrement for not-taken • 00,01,10,11 • Helps when target is known before condition T NT Predict Taken Predict Taken T NT T Predict Not Taken NT Predict Not Taken T NT
Review: Correlating Branches • Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) • Tournament Choose between alternative predictors • How do you choose? • Branch address 2-bits per branch predictor Prediction 2-bit global branch history
Review: Need Address @ Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address PC of Inst to fetch Branch Prediction: Taken or not Taken Predicted PC 0 … n-1 = Yes, use predicted PC No, not branch Procedure Return Addresses Predicted with a Stack
Interrupts and Exceptions • Unnatural change in control flow • warning: varying terminology • “exception” sometimes refers to all cases • “Trap” software trap, hardware trap • Exception is potential problem with program • condition occurs within the processor • segmentation fault • bus error • divide by 0 • Don’t want my bug to crash the entire machine • page fault (virtual memory…)
Interrupts and Exceptions • Interrupt is external event • devices: disk, network, keyboard, etc. • clock for timeslicing • These are useful events, must do something when they occur. • Trap is user-requested exception • operating system call (syscall)
Invoke specific kernel routine based on type of interrupt interrupt/exception handler Must determine what caused interrupt could use software to examine each device PC = interrupt_handler Vectored Interrupts PC = interrupt_table[i] Kernel initializes table at boot time Clear the interrupt May return from interrupt (RETT) to different process (e.g, context switch) Similar mechanism is used to handle interrupts, exceptions, traps Handling an Exception/Interrupt User Program ld add st div beq ld sub bne Interrupt Handler RETT
Execution Mode • What if interrupt occurs while in interrupt handler? • Problem: Could lose information for one interrupt clear of interrupt #1, clears both #1 and #2 • Solution: disable interrupts • Disabling interrupts is a protected operation • Only the kernel can execute it • user v.s. kernel mode • mode bit in CPU status register • Other protected operations • installing interrupt handlers • manipulating CPU state (saving/restoring status registers) • Changing modes • interrupts • system calls (syscall instruction)
A System Call (syscall) • Special Instruction to change modes and invoke service • read/write I/O device • create new process • Invokes specific kernel routine based on argument • kernel defined interface • May return from trap to different process (e.g, context switch) • RETT, instruction to return to user process User Program Kernel ld add st TA 6 beq ld sub bne Trap Handler RETT Service Routines
Interrupts/exceptions • classifying interrupts • terminal (fatal) vs. restartable (control returned to program) • synchronous (internal) vs. asynchronous (external) • user vs. coerced • maskable (ignorable) vs. non-maskable • between instructions vs. within instruction
Precise Exceptions “unobserved system can exist in any intermediate state, upon observation system collapses to well-defined state” • 2nd postulate of quantum mechanics • system processor, observation interrupt • what is the “well-defined” state? • von Neumann: “sequential, instruction atomic execution” • precise state at interrupt • all instructions older than interrupt are complete • all instructions younger than interrupt haven’t started • implies interrupts are taken in program order • necessary for VM (why?), “highly recommended” by IEEE
Pipelining Complications • Interrupts (Exceptions) • 5 instructions executing in 5 stage pipeline • How to stop the pipeline? • How to restart the pipeline? • Who caused the interrupt? StageProblem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation
Pipelining Complications • Simultaneous exceptions in > 1 pipeline stage • Load with data page fault in MEM stage • Add with instruction page fault in IF stage • Solution #1 • Interrupt status vector per instruction • Defer check til last stage, kill state update if exception • Solution #2 • Interrupt ASAP • Restart everything that is incomplete • Another advantage for state update late in pipeline!
Interrupts/Exceptions are Nasty • odd bits of state must be precise (e.g., condition codes) • delayed branches • what if instruction in delay slot takes an interrupt? • Out of order Writes (e.g., autoinc, multicycle ops) • must undo write (e.g., future-file, history-file) • some machines had precise interrupts only in integer pipe • sufficient for implementing VM (e.g., VAX/Alpha) • Lucky for us, there’s a nice, clean way to handle precise state • We’ll see how this is done in a couple of lectures ...
Pipelining x86 • The x86 ISA has some really nasty instructions - how did Intel ever figure out how to build a pipelined x86 microprocessor? • Solution: at runtime, “crack” x86 instructions (macro-ops) into RISC-like micro-ops • First used in P6 (Pentium Pro) • Used in all subsequent x86 processors, including those from AMD • What are the potential challenges for implementing this solution?
Where are We • principles of pipelining • pipeline depth: clock rate vs. number of stalls (CPI) • hazards • structural • data (RAW, WAR, WAW) • control • Branch prediction • multi-cycle operations • structural hazards, WAW hazards • interrupts • precise state • Next up: CPI < 1
Getting CPI < 1: Issuing Multiple Instructions/Cycle • “Flynn bottleneck” • single issue performance limit is CPI = IPC = 1 • hazards + overhead CPI >= 1 (IPC <= 1) • diminishing returns from deep pipelines • solution: issue multiple instructions per cycle • Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler (statically scheduled) or by HW (Tomasulo; dynamically scheduled) • First superscalar IBM America → RS6000 → Power1 • Pentium4, IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA-8000
Base Implementation • statically scheduled (in-order) superscalar • executes unmodified sequential programs • Figures out on its own what can be done in parallel • e.g., Sun UltraSPARC, Alpha 21164 • we’ll start with this one • What has to change from single issue to multiple issue?
CPI < 1: Issuing Multiple Instructions/Cycle • Ex 2-way superscalar: 1 FP & 1 anything else • – Fetch 64-bits/clock cycle; Int on left, FP on right • – Can only issue 2nd instruction if 1st instruction issues • – More ports for FP registers to do FP load & FP op in a pair • Type Pipe Stages • Int. instruction IF ID EX MEM WB • FP instruction IF ID EX MEM WB • Int. instruction IF ID EX MEM WB • FP instruction IF ID EX MEM WB • Int. instruction IF ID EX MEM WB • FP instruction IF ID EX MEM WB • 1 cycle load delay expands to 3 instructions in SS • instruction in right half can’t use it, nor instructions in next slot
Implications of Superscalar • what is involved in • fetching two instructions per cycle? • decoding two instructions per cycle? • executing two ALU operations per cycle? • accessing the data cache twice per cycle? • writing back two results per cycle? • what about 4 or 8 instructions per cycle?
Wide Fetch • Fetch N instructions per cycle • if instructions are sequential... • and on same cache line nothing really • and on different cache lines banked I$ + combining network • if instructions are not sequential... • more difficult • two serial I$ accesses (access1predict targetaccess2)? no • note: embedded branches OK as long as predicted NT • serial access + prediction in parallel • if prediction is T, discard serial part after branch • Trace Cache…
Wide Decode • Decode N instructions per cycle • actually decoding instructions? • easy if fixed length instructions (multiple decoders) • harder (but possible) if variable length • reading input register values? • 2N register read ports (register file read latency ~2N) • actually less than 2N, since most values come from bypasses • what about the stall logic to enforce RAW dependences?
N2 Dependence Check Logic • remember stall logic for single issue pipeline • rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W) • same for rs2(D) • full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD • doubling issue width (N) quadruples stall logic! • not only 2 instructions in D, but two instructions in every stage • (rs1(D1) == rd(D/X1) && op(D/X1) == LOAD) • (rs1(D1) == rd(D/X2) && op(D/X2) == LOAD) • repeat for rs1(D2), rs2(D1), rs2(D2) • also check dependence of 2nd instruction on 1st: rs1(D2) == rd(D1) • “N2 dependence cross-check” • for N-wide pipeline, stall (and bypass) circuits grow as N2
Superscalar Stalls • invariant: stalls propagate upstream to younger instructions • what if older instruction in issue “pair” (inst0) stalls? • younger instruction (inst1) stalls too, cannot pass it • what if younger instruction (inst1) stalls? • can older instruction from next group (inst2) move up? • Rigid pipeline: No • Fluid pipeline: Yes
Wide Execute • What does it take to execute N instructions per cycle? • multiple execution units...N of every kind? • N ALUs? OK, ALUs are small • N FP dividers? no, FP dividers are huge (and fdiv is uncommon) • typically have some mix (proportional to instruction mix) • RS/6000: 1 ALU/memory/branch + 1 FP • Pentium: 1 any + 1 ALU (Pentium) • Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branch • Alpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/store
N2 Bypass • N2 bypass logic... OK • only 5-bit quantities • compare to generate 1-bit outcomes • similar to stall logic • N2 bypass buses... not even close to OK • 32-bit or 64-bit quantities • broadcast, route, and multiplex (mux) • difficult to lay out and route all the wires • wide (SLOW) muxes • big design problem today
One Solution to N2 Bypass: Clustering • group functional units into clusters • full bypass within cluster • no bypass between clusters • ~(N/k) inputs at each mux • ~(N/k)2 routed buses in each cluster • steer instructions to different clusters • dependent instructions to same cluster • exploit intra-cluster bypass • static or dynamic steering is possible • e.g., Alpha 21264 • 4-wide, 300MHz • full bypass didn’t fit into 1 clock cycle • 2 clusters with full intra-cluster bypass
Wide Memory Access • what is involved in accessing memory for multiple instructions per cycle? • multi-banked D$ • requires bank assignment and conflict-detection logic • (rough) instruction mix: 20% loads, 15% stores • for width N, we need about 0.2*N load ports, 0.15*N store ports
Wide Writeback • what is involved in writing back multiple instructions per cycle? • nothing too special, just another port on the register file • everything else is taken care of earlier in pipeline • adding ports isn’t free, though • increases area • increases access latency
Multiple Issue Summary • superscalar problem spots • fetch, branch prediction trace cache? • decode (N2 dependence cross-check) • execute (N2 bypass) clustering?
Can we do better? • Problem: Stall in ID stage if any data hazard. • Your task: Teams of two, propose a design to eliminate these stalls. MULD F2, F3, F4 Long latency… ADDD F1, F2, F3 ADDD F3, F4, F5 ADDD F1, F4, F5
Next Time • Dynamic Scheduling • Read papers • HW #2 Assigned