1 / 172

Review of ECE301: Computer Organization

Review of ECE301: Computer Organization. AMD Barcelona: 4 cores. Abstractions. Abstraction helps us deal with complexity Hide lower-level detail Instruction set architecture (ISA) The hardware/software interface Application binary interface The ISA plus system software interface

Download Presentation

Review of ECE301: Computer Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review of ECE301: Computer Organization AMD Barcelona: 4 cores ECE610 - Fall 2013

  2. Abstractions • Abstraction helps us deal with complexity • Hide lower-level detail • Instruction set architecture (ISA) • The hardware/software interface • Application binary interface • The ISA plus system software interface • Implementation • The details underlying and interface E. D. Dijkstra “… the main challenge of computer science is how not to get lost in the complexities of their own making.” ECE610 - Fall 2013

  3. Defining Performance • Which airplane has the best performance? ECE610 - Fall 2013

  4. Response Time and Throughput • Response time • How long it takes to do a task • Throughput • Total work done per unit time • e.g., tasks/transactions/… per hour • How are response time and throughput affected by • Replacing the processor with a faster version? • Adding more processors? • We’ll focus on response time for now… ECE610 - Fall 2013

  5. Relative Performance • Define Performance = 1/Execution Time • “X is n time faster than Y” • Example: time taken to run a program • 10s on A, 15s on B • Execution TimeB / Execution TimeA= 15s / 10s = 1.5 • So A is 1.5 times faster than B ECE610 - Fall 2013

  6. Measuring Execution Time • Elapsed time • Total response time, including all aspects • Processing, I/O, OS overhead, idle time • Determines system performance • CPU time • Time spent processing a given job • Discounts I/O time, other jobs’ shares • Comprises user CPU time and system CPU time • Different programs are affected differently by CPU and system performance ECE610 - Fall 2013

  7. CPU Clocking • Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transferand computation Update state • Clock period: duration of a clock cycle • e.g., 250ps = 0.25ns = 250×10–12s • Clock frequency (rate): cycles per second • e.g., 4.0GHz = 4000MHz = 4.0×109Hz ECE610 - Fall 2013

  8. CPU Time • Performance improved by • Reducing number of clock cycles • Increasing clock rate • Hardware designer must often trade off clock rate against cycle count ECE610 - Fall 2013

  9. CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B • Aim for 6s CPU time • Can do faster clock, but causes 1.2 × clock cycles • How fast must Computer B clock be? ECE610 - Fall 2013

  10. Levels of Program Code • High-level language • Level of abstraction closer to problem domain • Provides for productivity and portability • Assembly language • Textual representation of instructions • Hardware representation • Binary digits (bits) • Encoded instructions and data ECE610 - Fall 2013

  11. Instruction Count and CPI • Instruction Count for a program • Determined by program, ISA and compiler • Average cycles per instruction • Determined by CPU hardware • If different instructions have different CPI • Average CPI affected by instruction mix ECE610 - Fall 2013

  12. CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? A is faster… …by this much ECE610 - Fall 2013

  13. CPI in More Detail • If different instruction classes take different numbers of cycles • Weighted average CPI Relative frequency ECE610 - Fall 2013

  14. CPI Example • Alternative compiled code sequences using instructions in classes A, B, C • Sequence 1: IC = 5 • Clock Cycles= 2×1 + 1×2 + 2×3= 10 • Avg. CPI = 10/5 = 2.0 • Sequence 2: IC = 6 • Clock Cycles= 4×1 + 1×2 + 1×3= 9 • Avg. CPI = 9/6 = 1.5 ECE610 - Fall 2013

  15. Performance Summary The BIG Picture • Performance depends on • Algorithm: affects IC, possibly CPI • Programming language: affects IC, CPI • Compiler: affects IC, CPI • Instruction set architecture: affects IC, CPI, Tc ECE610 - Fall 2013

  16. Power Trends • In CMOS IC technology (source: intel.com) ×30 5V → 1V ×1000 ECE610 - Fall 2013

  17. Reducing Power • Suppose a new CPU has • 85% of capacitive load of old CPU • 15% voltage and 15% frequency reduction • The power wall • We can’t reduce voltage further • We can’t remove more heat • How else can we improve performance? ECE610 - Fall 2013

  18. Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency ECE610 - Fall 2013

  19. Multiprocessors • Multicore microprocessors • More than one processor per chip • Requires explicitly parallel programming • Compare with instruction level parallelism • Hardware executes multiple instructions at once • Hidden from the programmer • Hard to do • Programming for performance • Load balancing • Optimizing communication and synchronization (source: Intel Inc. via Embedded.com) ECE610 - Fall 2013

  20. Manufacturing ICs • Yield: proportion of working dies per wafer ECE610 - Fall 2013

  21. AMD Opteron X2 Wafer • X2: 300mm wafer, 117 chips, 90nm technology • X4: 45nm technology ECE610 - Fall 2013

  22. Integrated Circuit Cost • Nonlinear relation to area and defect rate • Wafer cost and area are fixed • Defect rate determined by manufacturing process • Die area determined by architecture and circuit design ECE610 - Fall 2013

  23. Example ECE610 - Fall 2013

  24. SPEC CPU Benchmark • Programs used to measure performance • Supposedly typical of actual workload • Standard Performance Evaluation Corp (SPEC) • Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 • Elapsed time to execute a selection of programs • Negligible I/O, so focuses on CPU performance • Normalize relative to reference machine • Summarize as geometric mean of performance ratios • CINT2006 (integer) and CFP2006 (floating-point) ECE610 - Fall 2013

  25. CINT2006 for Opteron X4 2356 High cache miss rates ECE610 - Fall 2013

  26. Processor design ECE610 - Fall 2013

  27. Instruction Execution • PC  instruction memory, fetch instruction • Register numbers register file, read registers • Depending on instruction class • Use ALU to calculate • Arithmetic result • Memory address for load/store • Branch target address • Access data memory for load/store • PC  target address or PC + 4 ECE610 - Fall 2013

  28. MIPS Instruction Set Microprocessor without Interlocked Pipeline Stages ECE610 - Fall 2013

  29. Introduction • CPU performance factors • Instruction count • Determined by ISA and compiler • CPI and Cycle time • Determined by CPU hardware • We will examine two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq ECE610 - Fall 2013

  30. Three Instruction Classes ECE610 - Fall 2013

  31. CPU Overview ECE610 - Fall 2013

  32. Multiplexers • Can’t just join wires together • Use multiplexers ECE610 - Fall 2013

  33. Control ECE610 - Fall 2013

  34. Full Datapath ECE610 - Fall 2013

  35. Datapath With Control ECE610 - Fall 2013

  36. R-Type Instruction ECE610 - Fall 2013

  37. Load Instruction ECE610 - Fall 2013

  38. Branch-on-Equal Insn. ECE610 - Fall 2013

  39. Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory  register file  ALU  data memory  register file • Not feasible to vary period for different instructions • Violates design principle • Making the common case fast • We will improve performance by pipelining ECE610 - Fall 2013

  40. Pipeline Performance • Assume time for stages is • 100ps for register read or write • 200ps for other stages • Compare pipelined datapath with single-cycle datapath ECE610 - Fall 2013

  41. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) ECE610 - Fall 2013

  42. MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register ECE610 - Fall 2013

  43. Pipeline Speedup • If all stages are balanced • i.e., all take the same time • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • If not balanced, speedup is less • Speedup due to increased throughput • Latency (time for each instruction) does not decrease ECE610 - Fall 2013

  44. Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards • A required resource is busy • Data hazard • Need to wait for previous instruction to complete its data read/write • Control hazard • Deciding on control action depends on previous instruction ECE610 - Fall 2013

  45. Data Hazards • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 ECE610 - Fall 2013

  46. Forwarding (aka Bypassing) • Use result when it is computed • Don’t wait for it to be stored in a register • Requires extra connections in the datapath ECE610 - Fall 2013

  47. Load-Use Data Hazard • Can’t always avoid stalls by forwarding • If value not computed when needed • Can’t forward backward in time! ECE610 - Fall 2013

  48. Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw$t2, 4($t0) lw$t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles ECE610 - Fall 2013

  49. Control Hazards • Branch determines flow of control • Fetching next instruction depends on branch outcome • Pipeline can’t always fetch correct instruction • Still working on ID stage of branch • In MIPS pipeline • Need to compare registers and compute target early in the pipeline • Add hardware to do it in ID stage ECE610 - Fall 2013

  50. Stall on Branch • Wait until branch outcome determined before fetching next instruction ECE610 - Fall 2013

More Related