Advanced Computer Architecture Nov 19 th 2013

Advanced Computer ArchitectureNov 19th2013 曹强武汉国家光电实验室信息存储部华中科技大学计算机学院

Computer Engineering Methodology Market Evaluate Existing Systems for Bottlenecks Applications Implementation Complexity Benchmarks Technology Trends Simulate New Designs and Organizations Implement Next Generation System Workloads

Metrics used to Compare Designs • Cost • Die cost and system cost • Execution Time • average and worst-case • Latency vs. Throughput • Energy and Power • Also peak power and peak switching current • Reliability • Resiliency to electrical noise, part failure • Robustness to bad software, operator error • Maintainability • System administration costs • Compatibility • Software costs dominate

What is Performance? • Latency (or response time or execution time) • time to complete one task • Bandwidth (or throughput) • tasks completed per unit time

performance(x) = 1 execution_time(x) Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X) Definition: Performance • Performance is in units of things per sec • bigger is better • If we are primarily concerned with response time • " X is n times faster than Y" means

Performance: What to measure • Usually rely on benchmarks vs. real workloads • To increase predictability, collections of benchmark applications-- benchmark suites -- are popular • SPECCPU: popular desktop benchmark suite • CPU only, split between integer and floating point programs • SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms • SPECCPU2006 to be announced Spring 2006 • SPECSFS (NFS file server) and SPECWeb (WebServer) added as server benchmarks • Transaction Processing Council measures server performance and cost-performance for databases • TPC-C Complex query for Online Transaction Processing • TPC-H models ad hoc decision support • TPC-W a transactional web benchmark • TPC-App application server and web services benchmark

System Rate (Task 1) Rate (Task 2) A 10 20 B 20 10 Summarizing Performance Which system is faster?

Average Average Average System System System Rate (Task 1) Rate (Task 1) Rate (Task 1) Rate (Task 2) Rate (Task 2) Rate (Task 2) 15 1.00 1.25 A A A 0.50 10 1.00 1.00 2.00 20 1.00 1.25 15 B B B 1.00 2.00 20 10 1.00 0.50 … depends who’s selling Average throughput Throughput relative to B Throughput relative to A

Summarizing Performance over Set of Benchmark Programs Arithmetic mean of execution times ti (in seconds) 1/niti Harmonic mean of execution rates ri (MIPS/MFLOPS) n/ [i(1/ri)] • Both equivalent to workload where each program is run the same number of times • Can add weighting factors to model other workload distributions

Normalized Execution Timeand Geometric Mean • Measure speedup up relative to reference machine ratio = tRef/tA • Average time ratios using geometric mean n(I ratioi ) • Insensitive to machine chosen as reference • Insensitive to run time of individual benchmarks • Used by SPEC89, SPEC92, SPEC95, …, SPEC2006 ….. But beware that choice of reference machine can suggest what is “normal” performance profile:

Vector/Superscalar Speedup • 100 MHz Cray J90 vector machine versus 300MHz Alpha 21164 • [LANL Computational Physics Codes, Wasserman, ICS’96] • Vector machine peaks on a few codes????

Superscalar/Vector Speedup • 100 MHz Cray J90 vector machine versus 300MHz Alpha 21164 • [LANL Computational Physics Codes, Wasserman, ICS’96] • Scalar machine peaks on one code???

How to Mislead with Performance Reports • Select pieces of workload that work well on your design, ignore others • Use unrealistic data set sizes for application (too big or too small) • Report throughput numbers for a latency benchmark • Report latency numbers for a throughput benchmark • Report performance on a kernel and claim it represents an entire application • Use 16-bit fixed-point arithmetic (because it’s fastest on your system) even though application requires 64-bit floating-point arithmetic • Use a less efficient algorithm on the competing machine • Report speedup for an inefficient algorithm (bubblesort) • Compare hand-optimized assembly code with unoptimized C code • Compare your design using next year’s technology against competitor’s year old design (1% performance improvement per week) • Ignore the relative cost of the systems being compared • Report averages and not individual results • Report speedup over unspecified base system, not absolute times • Report efficiency not absolute times • Report MFLOPS not absolute times (use inefficient algorithm) [ David Bailey “Twelve ways to fool the masses when giving performance results for parallel supercomputers” ]

Amdahl’s Law Best you could ever hope to do:

Amdahl’s Law example • New CPU 10X faster • I/O bound server, so 60% time waiting for I/O • Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPI Computer Performance Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X inst count Cycle time

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” • CPI = (CPU Time * Clock Rate) / Instruction Count • = Cycles / Instruction Count “Instruction Frequency”

Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine counters, or sampling) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix of instruction types in program Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

Power and Energy • Energy to complete operation (Joules) • Corresponds approximately to battery life • (Battery energy capacity actually depends on rate of discharge) • Peak power dissipation (Watts = Joules/second) • Affects packaging (power and ground pins, thermal design) • di/dt, peak change in supply current (Amps/second) • Affects power supply noise (power and ground pins, decoupling capacitors)

Power Problem

CMOS Power Equations Power due to short-circuit current during transition Dynamic power consumption Power due to leakage current Reduce the supply voltage, V Reduce threshold Vt

CMOS Scaling • Historic CMOS scaling • Doubling every two years (Moore’s law) • Feature size • Device density • Device switching speed improves 30-40%/generation • Supply & threshold voltages decrease (Vdd, Vth) • Projected CMOS scaling • Feature size, device density scaling continues • ~10 year roadmap out to sub-10nm generation • Switching speed improves ~20%/generation • Voltage scaling tapers off quickly • SRAM cell stability becomes an issue at ~0.7V Vdd

Dynamic Power • Static CMOS: current flows when active • Combinational logic evaluates new inputs • Flip-flop, latch captures new value (clock edge)‏ • Terms • C: capacitance of circuit • wire length, number and size of transistors • V: supply voltage • A: activity factor • f: frequency • Future: Fundamentally power-constrained

Reducing Dynamic Power • Reduce capacitance • Simpler, smaller design (yeah right) • Reduced IPC • Reduce activity • Smarter design • Reduced IPC • Reduce frequency • Often in conjunction with reduced voltage • Reduce voltage • Biggest hammer due to quadratic effect, widely employed • Can be static (binning/sorting of parts), and/or • Dynamic (power modes) • E.g. Transmeta Long Run, AMD PowerNow, Intel Speedstep

Frequency/Voltage relationship • Lower voltage implies lower frequency • Lower Vth increases delay to sense/latch 0/1 • Conversely, higher voltage enables higher frequency • Overclocking • Sorting/binning and setting various Vdd & Vth • Characterize device, circuit, chip under varying stress conditions • Black art – very empirical & closely guarded trade secret • Implications on reliability • Safety margins, product lifetime • This is why overclocking is possible

Frequency/Voltage Scaling • Voltage/frequency scaling rule of thumb: • +/- 1% performance buys -/+ 3% power (3:1 rule) • Hence, any power-saving technique that saves less than 3x power over performance loss is uninteresting • Example 1: • New technique saves 12% power • However, performance degrades 5% • Useless, since 12 < 3 x 5 • Instead, reduce f by 5% (also V), and get 15% power savings • Example 2: • New technique saves 5% power • Performance degrades 1% • Useful, since 5 > 3 x 1 • Does this rule always hold?

Leakage Power (Static/DC) • Transistors aren’t perfect on/off switches • Even in static CMOS, transistors leak • Channel (source/drain) leakage • Gate leakage through insulator • High-K dielectric replacing SiO2 helps • Leakage compounded by • Low threshold voltage • Low Vth => fast switching, more leakage • High Vth => slow switching, less leakage • Higher temperature • Temperature increases with power • Power increases with C, V2, A, f • Rough approximation: leakage proportional to area • Transistors aren’t free, unless they’re turned off • Huge problem in future technologies • Estimates are 40%-50% of total power Source Gate Drain

Power vs. Energy Power Energy • Energy: integral of power (area under the curve) • Energy & power driven by different design constraints • Power issues: • Power delivery (supply current @ right voltage) • Thermal (don’t fry the chip) • Reliability effects (chip lifetime) • Energy issues: • Limited energy capacity (battery) • Efficiency (work per unit energy) • Different usage models drive tradeoffs Time

Power vs. Energy • With constant time base, two are “equivalent” • 10% reduction in power => 10% reduction in energy • Once time changes, must treat as separate metrics • E.g. reduce frequency to save power => reduce performance => increase time to completion => consume more energy (perhaps) • Metric: energy-delay product per unit of work • Tries to capture both effects, accounts for quadratic savings from DVS • Others advocate energy-delay2 (accounts for cubic effect) • Best to consider all • Plot performance (time), energy, ed, ed2

Usage Models • Thermally limited => dynamic power dominates • Max power (“power virus” contest at Intel) • Must deliver adequate power (or live within budget) • Must remove heat • From chip, from case, from room, from building • Chip hot spots cause problems • Efficiency => dynamic & static power matter • E.g. energy per DVD frame • Analogy: cell-phone “talk time” • Longevity => static power dominates • Minimum power while still “awake” • Cellphone “standby” time • Laptop still responds quickly • Not suspend/hibernate • “Power state” management very important • Speedstep, PowerNow, LongRun Worst Case Average Case Best Case

Architectural Techniques • Multicore chips (later) • Clock gating (dynamic power) • 70% of dynamic power in IBM Power5 [Jacobson et al., HPCA 04] • Inhibit clock for • Functional block • Pipeline stage • Pipeline register (sub-stage) • Widely used in real designs today • Control overhead, timing complexity (violates fully synchronous design rules) • Power gating (leakage power) • (Big) sleep transistor cuts off Vdd or ground path • Apply to FU, cache subarray, even entire core in CMP

Architectural Techniques • Cache reconfiguration (leakage power) • Not all applications or phases require full L1 cache capacity • Power gate portions of cache memory • State-preservation • Flush/refill (non-state preserving) [Powell et al., ISLPED 2000] • Drowsy cache (state preserving) [Flautner et al., ISCA 2002] • Complicates a critical path (L1 cache access) • Does not apply to lower level caches • High Vth (slower) transistors already prevent leakage

Architectural Techniques • Filter caches (dynamic power) • Many references are required for correctness but result in misses • External snoops [Jetty, HPCA ‘01] • Load/store alias checks [Sethumadhavan et al., MICRO ‘03] • Filter caches summarize cache contents (e.g. Bloom filter) • Much smaller filter cache lookup avoids lookup in large/power-hungry structure • Heterogeneous cores [Kumar et al., MICRO-36] • Prior-generation simple core consumes small fraction of die area • Use simple core to run low-ILP workloads • And many others…check proceedings of • ISLPED, MICRO, ISCA, HPCA, ASPLOS, PACT

Variability • Shrinking device dimensions lead to sensitivity to minor processing variations “No two transistors are the same” • Die-to-die variations • Across multiple die on same wafer, across wafers • Within-die variations • Systematic and random • E.g. line edge roughness due to sub-wavelength lithography or dopant variations (~10 molecules) • Dynamic variations • E.g. temperature-induced variability (hot spots)

Peak Power versus Lower Energy • System A has higher peak power, but lower total energy • System B has lower peak power, but higher total energy Peak A Peak B Power Integrate power curve to get energy Time

Fixed Chip Power Budget n # CPUs f 1 1-f Time • Amdahl’s Law • Ignores (power) cost of n cores • Revised Amdahl’s Law • More cores  each core is slower • Parallel speedup < n • Serial portion (1-f) takes longer • Also, interconnect and scaling overhead

Fixed Power Scaling Fixed power budget forces slow cores Serial code quickly dominates

Cumulative distribution function: CDF t F(t) = prob[xt] = 0f(x)dx Probability density function: pdf f(t) = prob[txt + dt] / dt = dF(t) / dt Expected value of x + Ex = -xf(x)dx = kxkf(xk) Variance of x + 2 sx = -(x – Ex)2f(x)dx = k(xk – Ex)2f(xk) Covariance of x and y yx,y = E[(x – Ex)(y – Ey)] = E[xy] – ExEy Concepts from Probability Theory Lifetimes of 20 identical systems

Some Simple Probability Distributions

1% miss 1% miss 1% miss 10–8 miss probability 1% miss Layers of Safeguards With multiple layers of safeguards, a system failure occurs only if warning symptoms and compensating actions are missed at every layer, which is quite unlikely Is it really? The computer engineering literature is full of examples of mishaps when two or more layers of protection failed at the same time Multiple layers increase the reliability significantly only if the “holes” in the representation above are fairly randomly and independently distributed, so that the probability of their being aligned is negligible Dec. 1986: ARPANET had 7 dedicated lines between NY and Boston; A backhoe accidentally cut all 7 (they went through the same conduit)

Reliability: R(t) Probability that system remains in the “Good” state through the interval [0, t] R(t + dt) = R(t) [1 – z(t)dt] Up Down Start state Failure Hazard function R(t) = 1 – F(t) CDF of the system lifetime, or its unreliability Mean time to failure: MTTF + + MTTF = 0tf(t)dt = 0R(t)dt Area under the reliability curve (easily provable) Expected value of lifetime Reliability and MTTF Two-state nonrepairable system Exponential reliability law Constant hazard function z(t) = l R(t) = e–lt (system failure rate is independent of its age)

Discrete versions Geometric R(k) = q k Normal: Reliability and MTTF formulas are complicated Rayleigh: z(t) = 2l(lt) R(t) = e(-lt)2 MTTF = (1/l)p / 2 Discrete Weibull Erlang: Gen. exponential MTTF = k/l Gamma: Weibull: z(t) = al(lt)a–1 Gen. Erlang (becomes Erlang for b an integer) R(t) = e(-lt)a MTTF = (1/l) G(1 + 1/a) Binomial Failure Distributions of Interest Exponential: z(t) = l R(t) = e–lt MTTF = 1/l

Weibull: z(t) = al(lt)a–1 R(t) = e(-lt)a Elaboration on Weibull Distribution a < 1, Infant mortality a = 1, Constant hazard rate (exponential) ln ln[1/R(t)] = a(ln t + ln l) 1 < a < 4, Rising hazard (fatigue, corrosion) a > 4, Rising hazard (rapid wearout) The following diagrams are from: http://www.rpi.edu/~albenr/presentations/Reliabilty.ppt One cycle a = 2.6

Reliability gain: R2 / R1 System Reliability (R) Reliability difference: R2 – R1 Reliability improvement factor RIF2/1 = [1–R1(tM)]/[1–R2(tM)] Example: [1 – 0.9] / [1 – 0.99] = 10 Reliability improv. index RII = log R1(tM) / log R2(tM) Mission time extension Mission time improv. factor: MTE2/1(rG) = T2(rG) – T1(rG) MTIF2/1(rG) = T2(rG) / T1(rG) Comparing Reliabilities Reliability functions for Systems 1 and 2

Reliability improv. index RII = log Roriginal / log Rimproved Analog of Amdahl’s Law for Reliability Amdahl’s law: If in a unit-time computation, a fraction f doesn’t change and the remaining fraction 1 – f is speeded up to run p times as fast, the overall speedup will be s = 1 / (f + (1 – f)/p) Consider a system with two parts, having failure rates f and l – f Improve the failure rate of the second part by a factor p, to (l – f)/p Roriginal = exp(–lt) Rimproved = exp[–(f + (l – f)/p)t] RII = l / (f + (l – f)/p) Letting f / l = f, we have: RII = 1 / (f + (1 – f)/p)

Repair Up Down Start state Failure (Interval) Availability: A(t) Fraction of time that system is in the “Up” state during the interval [0, t] Pointwise availability: a(t) Probability that system available at time t A(t) = (1/t) 0a(x)dx t MTTF MTTF m MTTF + MTTR MTBF l + m Repair rate 1/m = MTTR (Will justify this equation later) A = = = 2.3 Availability, MTTR, and MTBF Fig. 2.5 Two-state repairable system Steady-state availability: A = limt A(t) Availability = Reliability, when there is no repair Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair) In general, m >> l, leading to A 1

Cumulative distribution function: CDF t F(t) = prob[xt] = 0f(x)dx Probability density function: pdf f(t) = prob[txt + dt] / dt = dF(t) / dt Expected value of x + Ex = -xf(x)dx = kxkf(xk) Variance of x + 2 sx = -(x – Ex)2f(x)dx = k(xk – Ex)2f(xk) Covariance of x and y yx,y = E[(x – Ex)(y – Ey)] = E[xy] – ExEy Concepts from Probability Theory Lifetimes of 20 identical systems

Some Simple Probability Distributions

User Fi 1.2 A Motivating Case Study Data availability and integrity concerns Distributed DB system with 5 sites Full connectivity, dedicated links Only direct communication allowed Sites and links may malfunction Redundancy improves availability S: Probability of a site being available L: Probability of a link being available Single-copy availability = SL Unavailability = 1 – SL = 1 – 0.99  0.95 = 5.95% Fig. 1.2 Data replication methods, and a challenge File duplication: home / mirror sites File triplication: home / backup 1 / backup 2 Are there availability improvement methods with less redundancy?

A = SL + (1 – SL)SL Primary site can be reached Mirror site can be reached Primary site inaccessible Data Duplication: Home and Mirror Sites Fi mirror S: Site availability e.g., 99% L: Link availability e.g., 95% User Duplicated availability = 2SL – (SL)2 Unavailability = 1 – 2SL + (SL)2 = (1 – SL)2 = 0.35% Fi home Data unavailability reduced from 5.95% to 0.35% Availability improved from  94% to 99.65%

Advanced Computer Architecture Nov 19 th 2013

Advanced Computer Architecture Nov 19 th 2013

Presentation Transcript

CS203 – Advanced Computer Architecture

Advanced Computer Architecture

CSCI 8150 Advanced Computer Architecture

19 th Century Architecture

19 th Century Architecture

Computer Architecture Advanced Topics

Denis Muhangi (PhD Candidate) 19 th Nov 2013

Computer Architecture Advanced Topics

Wednesday, Nov. 19, 2013

CMPE 421 Advanced Computer Architecture

6.893: Advanced VLSI Computer Architecture

Advanced Computer Architecture

Advanced Computer Architecture CSE 8383

Advanced Computer Architecture

Tuesday, Nov. 19, 2013

Advanced Computer Architecture

CS355 Advanced Computer Architecture

ECE729 : Advanced Computer Architecture

Advanced Computer Architecture

Advanced Computer Architecture 5MD00

EC6009 ADVANCED COMPUTER ARCHITECTURE