300 likes | 453 Views
Arquitectura de Computadores Metrics of Performance Evaluation. Semestre 2013-1 Profesor: Sebastián Isaza. Bibliography and evaluation. Bibliography Lecture slides
E N D
Arquitectura de ComputadoresMetrics of Performance Evaluation Semestre 2013-1 Profesor: Sebastián Isaza
Bibliography and evaluation Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza Bibliography • Lectureslides • Chapter 1: Computerarchitecture – A quantitativeapproach, J. Henessy and D. Patterson, Morgan Kaufman, 5th Edition, 2011 (previouseditionsmay be goodtoo). Evaluation • Quiz (10%)
How good is a computer? These are themetrics, thethingswewanttoestimateormeasure (notall of them are easytomeasurethough) Arquitectura de Computadores (2013-1) – Prof. SebastiánIsaza • We can think of many parameters: • Clock rate of computer • Power consumed by a program • Execution time for a program • Number of tasks done per second • Reliability • Aesthetic appearance • Social repercussion, etc… • How should we compare two computer systems?
Pareto optimality Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza
Performance: Latency vs. Throughput Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Latency: time to finish a fixed task • Throughput: number of tasks per unit of time • Different: exploit parallelism for throughput, not latency • Usually a trade-off: latency vs. throughput • Choose definition of performance that matches your goals • Scientific program: latency; web server: throughput? • Example: transport people 10 km • Car: capacity = 5, speed = 60 kmh • Bus: capacity = 60, speed = 20 kmh • Latency: car = 10 min, bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH
Comparing Performance Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • A is X times faster than B if • Latency(A) = Latency(B) / X • Throughput(A) = Throughput(B) * X • A is X% faster than B if • Latency(A) = Latency(B) / (1+X/100) • Throughput(A) = Throughput(B) * (1+X/100) • Car/bus example • Latency? Car is 3 times (and 200%) faster than bus • Throughput? Bus is 4 times (and 300%) faster than car
CPU Performance Equation Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza 3 components to execution time: Factors affecting CPU execution time:
Cycles per Instruction (CPI) programdependent! Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Dependsontheinstruction • CPIi= Execution Time of Instruction i * Clock Rate • Computing the total CPI: • Example:
Another CPI Example Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Assume a processor with instruction frequencies and costs • Integer ALU: 50%, 1 cycle • Load: 20%, 5 cycle • Store: 10%, 1 cycle • Branch: 20%, 2 cycle • Which change would improve performance more? • Faster branch prediction to reduce branch cost to 1 cycle? • Better data cache to reduce load cost to 3 cycles? • Compute CPI • Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 • A = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*1 = 1.8 • B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 (winner)
Mean (Average) Performance Numbers • For units that are proportional to time (e.g., latency) For units that are inversely proportional to time (e.g., throughput) For unitless quantities (e.g., speedup ratios) Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Arithmetic: You can add latencies, but not throughputs • 1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour • Average is not 60 miles/hour • Harmonic: • Geometric:
IPC, MIPS and GHz 1/IPC 1/MIPS 1/GHz Meta-point: danger of partial performance metrics! Arquitectura de Computadores (2013-1) – Prof. SebastiánIsaza • The metrics you are most likely to see in marketing are IPC (instruction per cycle), MIPS (million instruction per second) and GHz How are they incomplete? • Back to the CPU time formula: • Which processor would you buy? • Processor A: CPI = 2, clock = 5 GHz • Processor B: CPI = 1, clock = 3 GHz • Probably A, but B is faster (assuming same ISA/compiler) • GHz can be boosted artificially by design (lower the other 2 terms) e.g., 800 MHz PentiumIIIfasterthan 1 GHz Pentium4!
Inter-InsnParallelism: Pipelining Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Pipelining: cut datapath into N stages (here 5) • Separate each stage of logic by latches • Clock period: maximum logic + wire delay of any stage = max(Tinsn-mem, Tregfile, TALU, Tdata-mem) • Base CPI = 1, but actual CPI > 1: pipeline must often stall • Individual insn latency increases (pipeline overhead), not the point
Pipelining: Clock Frequency vs. IPC Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Increase number of pipeline stages (“pipeline depth”) • Keep cutting datapath into finer pieces • Increases clock frequency (decreases clock period) • Latch overhead and unbalanced stages cause sub-linear scaling • Double the number of stages won’t quite double the frequency • DecreasesIPC (increase CPI) • At some point, actually causes performance to decrease • “Optimal” pipeline depth is program and technology specific • Classicexample • Pentium III: 12 stage pipeline, 800 MHz • Pentium 4: 22 stage pipeline, 1 GHz • Actually slower (because of lower IPC) • Core 2: 15 stage pipeline • Intel learneditslesson
CPI and ClockFrequency Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Clock frequency implies CPU clock • Other system components have their own clocks (or not) • E.g., increasing processor clock doesn’t accelerate memory • Example: a 1 Ghz processor with • 50% memory instructions, 50% non-memory, all 1-cycle latency • Base: CPI is 1, frequency is 1Ghz ! MIPS is 1000 • Impact of double the core clock freq? • Without speeding up the memory • Non-memory instructions retain 1-cycle latency • Memory instructions now have 2-cycle latency • CPI = (50% * 1) + (50% * 2) = 1.5 • New: CPI is 1.5, frequency is 2Ghz ! MIPS is 1333 • Speedup= 1333/1000 = 1.33 << 2 • What about an infinite clock frequency? • Only a factor of 2 speedup (example of Amdahl’s Law)
Measuring CPI Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • How are CPI and execution-time actually measured? • Execution time? stopwatch timer (Unix “time” command) • CPI = CPU time / (clock frequency * dynamic insn count) • How is dynamic insn count measured? • More useful is CPI breakdown (CPICPU, CPIMEM, etc.) • So we know what performance problems are and what to fix • Hardware eventcounters • Available in most processors today • One way to measure dynamic instruction count • Calculate CPI using counter frequencies / known event costs • Cycle-level micro-architecture simulation (e.g., SimpleScalar) • Measure exactly what you want … and impact of potential fixes! • Method of choice for many micro-architects
Performance Trends Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Historically, clock provided 75%+ of performance gains… • Achieved via both faster transistors and deeper pipelines • … that’s changed: 1GHz: ‘99, 2GHz: ‘01, 3GHz: ‘02, 4Ghz? • Deep pipelining is not power efficient • Physical scaling limits are approaching
Improving CPI: Caching and Parallelism Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Examples: • Caching, speculation, multiple issue, out-of-order issue, vectors, multiprocessing, more… • Moore’s Law can help IPC – “more transistors” • Best examples are caches (to improve memory component of CPI) • Parallelism: • IPC > 1 impliesinstructions in parallel • And nowmulti-processors (multi-cores) • But also speculation, wide issue, out-of-order issue, vectors… • All roads lead to multi-core • Why multi-core over still bigger caches, yet wider issue? • Diminishing returns, limited ILP in programs • Multi-core can provide linear performance with transistor count (really?)
Gene Amdahl Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza American computer architecht Born in 1922 Worked for IBM until 1970 Founded Amdahl Corporation to compete in the mainframe market against IBM Proposed the later known as “Amdahl’s Law” during the 1967 Spring Joint Computer Conference
Amdahl’slaw Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza Suppose an enhancement speeds up a fraction f of a task by a factor of Sf If f is small Sf doesn’t matter. Concentrate effort on improving frequently occurring events or frequently used
Practicing Amdahl’s law Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza
Performance Rules of Thumb Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Amdahl’sLaw • Literally: total speedup limited by non-accelerated piece • Example: can optimize 50% of program A • Even “magic” optimization that makes this 50% disappear… • …only yields a 2X speedup • Corollary: build a balanced system • Don’t optimize 1% to the detriment of other 99% • Don’t over-engineer capabilities that cannot be utilized • Design for actual performance, not peak performance • Peak performance: “Performance you are guaranteed not to exceed” • Greater than “actual” or “average” or “sustained” performance • Why? Caches misses, branchmispredictions, limited ILP, etc. • For actual performance X, machine capability must be > X
Which programs to measure Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza
Summary Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Latency = seconds / program = (instructions / program) * (cycles / instruction) * (seconds / cycle) • Instructions/ program: dynamicinstructioncount • Function of program, compiler, instruction set architecture (ISA) • Cycles/ instruction: CPI • Function of program, compiler, ISA, micro-architecture • Seconds/ cycle: clockperiod • Function of micro-architecture, technology parameters
The Power Wall Power density trend. Source: Fred Pollack, Intel. Keynote speech Micro32, 1999. Arquitectura de Computadores (2013-1) – Prof. SebastiánIsaza Taken from D. Patterson and J. Henessy, Computer Architecture – A quantitaive approach, 5th Ed, 2011.
Power and Energy Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Why are they important? • Power produces heat that must be dissipated • Energy consumption limits the use of mobile devices • Energy bill must be paid • Environment! • Remember that power is energy per unit of time • What does each of them tell us? Example: • Processor A has a 20% higher average power consumption than processor B • Processor A executes a task in 70% of the time needed by B • Processor A consumes less energy to do the same job: 1.2x0.7=0.84
CMOS Power Consumption Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Power consumption of a CMOS gate: P = Psw + Psc + Plk where: Psw = Switching or dynamic power. Psc = Short-circuit power. Plk = Leakage or static power. • In older technologies (0.25um and above), Plkwas marginal w.r.t. switching power: • Switching power minimization was the primary objective. • In deep sub-micron processes, Plkbecomes critical: • Leakage accounts for around 5-10% of power budget at 180nm; this grows to 20-25% at 130nm and to 35-50% at 32nm. • Leakage power minimization must be faced from the design stand-point, not just at the technology/process level.
Power Dissipation Due to Switching: Dynamic Power Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza • Switching power of a CMOS gate: Psw = 0.5 Vdd2 fClock CL ESW fClock= Clock frequency CL = Output load capacitance ESW = Switching activity factor • ESWrepresents the probability that the output node makes a transition at each clock cycle. • Models the fact that, in general, switching does not occur at the clock frequency. • It is called the switching activity of the gate.
Power Dissipation Due to Leakage: Static Power Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza Leakage power of a CMOS gate: PLeakage = ILVdd where: Vdd = Supply voltage IL = Leakage current • Leakage current IL consists of two major contributions: IL = Isub + Igate where: Isub = Sub-threshold current caused by low threshold voltage. Igate = Gate current caused by reduced thickness of gate oxide. • Isubdominates, but grows by 5X per generation. • Igateis less relevant, but grows much faster (500X per generation).
Power reduction techniques Arquitectura de Computadores (2013-1) – Prof. SebastiánIsaza Clock gating: disconnect clock from a module DFS: Dynamic Frequency Scaling DVS: Dynamic Voltage Scaling Which one is more effective? DVFS: Dynamic Voltage and Frequency Scaling, combination of the two previous Power gating: disconnect power of a module… leads to dark silicon
Example Arquitectura de Computadores (2013-1) – Prof. Sebastián Isaza