1.74k likes | 1.98k Views
Review of ECE301: Computer Organization. AMD Barcelona: 4 cores. Abstractions. Abstraction helps us deal with complexity Hide lower-level detail Instruction set architecture (ISA) The hardware/software interface Application binary interface The ISA plus system software interface
E N D
Review of ECE301: Computer Organization AMD Barcelona: 4 cores ECE610 - Fall 2013
Abstractions • Abstraction helps us deal with complexity • Hide lower-level detail • Instruction set architecture (ISA) • The hardware/software interface • Application binary interface • The ISA plus system software interface • Implementation • The details underlying and interface E. D. Dijkstra “… the main challenge of computer science is how not to get lost in the complexities of their own making.” ECE610 - Fall 2013
Defining Performance • Which airplane has the best performance? ECE610 - Fall 2013
Response Time and Throughput • Response time • How long it takes to do a task • Throughput • Total work done per unit time • e.g., tasks/transactions/… per hour • How are response time and throughput affected by • Replacing the processor with a faster version? • Adding more processors? • We’ll focus on response time for now… ECE610 - Fall 2013
Relative Performance • Define Performance = 1/Execution Time • “X is n time faster than Y” • Example: time taken to run a program • 10s on A, 15s on B • Execution TimeB / Execution TimeA= 15s / 10s = 1.5 • So A is 1.5 times faster than B ECE610 - Fall 2013
Measuring Execution Time • Elapsed time • Total response time, including all aspects • Processing, I/O, OS overhead, idle time • Determines system performance • CPU time • Time spent processing a given job • Discounts I/O time, other jobs’ shares • Comprises user CPU time and system CPU time • Different programs are affected differently by CPU and system performance ECE610 - Fall 2013
CPU Clocking • Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transferand computation Update state • Clock period: duration of a clock cycle • e.g., 250ps = 0.25ns = 250×10–12s • Clock frequency (rate): cycles per second • e.g., 4.0GHz = 4000MHz = 4.0×109Hz ECE610 - Fall 2013
CPU Time • Performance improved by • Reducing number of clock cycles • Increasing clock rate • Hardware designer must often trade off clock rate against cycle count ECE610 - Fall 2013
CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B • Aim for 6s CPU time • Can do faster clock, but causes 1.2 × clock cycles • How fast must Computer B clock be? ECE610 - Fall 2013
Levels of Program Code • High-level language • Level of abstraction closer to problem domain • Provides for productivity and portability • Assembly language • Textual representation of instructions • Hardware representation • Binary digits (bits) • Encoded instructions and data ECE610 - Fall 2013
Instruction Count and CPI • Instruction Count for a program • Determined by program, ISA and compiler • Average cycles per instruction • Determined by CPU hardware • If different instructions have different CPI • Average CPI affected by instruction mix ECE610 - Fall 2013
CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? A is faster… …by this much ECE610 - Fall 2013
CPI in More Detail • If different instruction classes take different numbers of cycles • Weighted average CPI Relative frequency ECE610 - Fall 2013
CPI Example • Alternative compiled code sequences using instructions in classes A, B, C • Sequence 1: IC = 5 • Clock Cycles= 2×1 + 1×2 + 2×3= 10 • Avg. CPI = 10/5 = 2.0 • Sequence 2: IC = 6 • Clock Cycles= 4×1 + 1×2 + 1×3= 9 • Avg. CPI = 9/6 = 1.5 ECE610 - Fall 2013
Performance Summary The BIG Picture • Performance depends on • Algorithm: affects IC, possibly CPI • Programming language: affects IC, CPI • Compiler: affects IC, CPI • Instruction set architecture: affects IC, CPI, Tc ECE610 - Fall 2013
Power Trends • In CMOS IC technology (source: intel.com) ×30 5V → 1V ×1000 ECE610 - Fall 2013
Reducing Power • Suppose a new CPU has • 85% of capacitive load of old CPU • 15% voltage and 15% frequency reduction • The power wall • We can’t reduce voltage further • We can’t remove more heat • How else can we improve performance? ECE610 - Fall 2013
Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency ECE610 - Fall 2013
Multiprocessors • Multicore microprocessors • More than one processor per chip • Requires explicitly parallel programming • Compare with instruction level parallelism • Hardware executes multiple instructions at once • Hidden from the programmer • Hard to do • Programming for performance • Load balancing • Optimizing communication and synchronization (source: Intel Inc. via Embedded.com) ECE610 - Fall 2013
Manufacturing ICs • Yield: proportion of working dies per wafer ECE610 - Fall 2013
AMD Opteron X2 Wafer • X2: 300mm wafer, 117 chips, 90nm technology • X4: 45nm technology ECE610 - Fall 2013
Integrated Circuit Cost • Nonlinear relation to area and defect rate • Wafer cost and area are fixed • Defect rate determined by manufacturing process • Die area determined by architecture and circuit design ECE610 - Fall 2013
Example ECE610 - Fall 2013
SPEC CPU Benchmark • Programs used to measure performance • Supposedly typical of actual workload • Standard Performance Evaluation Corp (SPEC) • Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 • Elapsed time to execute a selection of programs • Negligible I/O, so focuses on CPU performance • Normalize relative to reference machine • Summarize as geometric mean of performance ratios • CINT2006 (integer) and CFP2006 (floating-point) ECE610 - Fall 2013
CINT2006 for Opteron X4 2356 High cache miss rates ECE610 - Fall 2013
Processor design ECE610 - Fall 2013
Instruction Execution • PC instruction memory, fetch instruction • Register numbers register file, read registers • Depending on instruction class • Use ALU to calculate • Arithmetic result • Memory address for load/store • Branch target address • Access data memory for load/store • PC target address or PC + 4 ECE610 - Fall 2013
MIPS Instruction Set Microprocessor without Interlocked Pipeline Stages ECE610 - Fall 2013
Introduction • CPU performance factors • Instruction count • Determined by ISA and compiler • CPI and Cycle time • Determined by CPU hardware • We will examine two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq ECE610 - Fall 2013
Three Instruction Classes ECE610 - Fall 2013
CPU Overview ECE610 - Fall 2013
Multiplexers • Can’t just join wires together • Use multiplexers ECE610 - Fall 2013
Control ECE610 - Fall 2013
Full Datapath ECE610 - Fall 2013
Datapath With Control ECE610 - Fall 2013
R-Type Instruction ECE610 - Fall 2013
Load Instruction ECE610 - Fall 2013
Branch-on-Equal Insn. ECE610 - Fall 2013
Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory register file ALU data memory register file • Not feasible to vary period for different instructions • Violates design principle • Making the common case fast • We will improve performance by pipelining ECE610 - Fall 2013
Pipeline Performance • Assume time for stages is • 100ps for register read or write • 200ps for other stages • Compare pipelined datapath with single-cycle datapath ECE610 - Fall 2013
Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) ECE610 - Fall 2013
MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register ECE610 - Fall 2013
Pipeline Speedup • If all stages are balanced • i.e., all take the same time • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • If not balanced, speedup is less • Speedup due to increased throughput • Latency (time for each instruction) does not decrease ECE610 - Fall 2013
Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards • A required resource is busy • Data hazard • Need to wait for previous instruction to complete its data read/write • Control hazard • Deciding on control action depends on previous instruction ECE610 - Fall 2013
Data Hazards • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 ECE610 - Fall 2013
Forwarding (aka Bypassing) • Use result when it is computed • Don’t wait for it to be stored in a register • Requires extra connections in the datapath ECE610 - Fall 2013
Load-Use Data Hazard • Can’t always avoid stalls by forwarding • If value not computed when needed • Can’t forward backward in time! ECE610 - Fall 2013
Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw$t2, 4($t0) lw$t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles ECE610 - Fall 2013
Control Hazards • Branch determines flow of control • Fetching next instruction depends on branch outcome • Pipeline can’t always fetch correct instruction • Still working on ID stage of branch • In MIPS pipeline • Need to compare registers and compute target early in the pipeline • Add hardware to do it in ID stage ECE610 - Fall 2013
Stall on Branch • Wait until branch outcome determined before fetching next instruction ECE610 - Fall 2013