CS295: Modern Systems The End of Conventional Processor Performance Scaling

CS295: Modern SystemsThe End of Conventional Processor Performance Scaling Sang-Woo Jun Spring 2019

Moore’s Law • Typically cast as: “Performance doubles every X months” • Actually closer to: “Number of transistors per unit cost doubles every two years” The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. […] Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. -- Gordon Moore, Electronics, 1965 Why is Moore’s Law conflated with processor performance?

Dennard Scaling • “Power density stays constant as transistors get smaller” • Intuitively: • Smaller transistors → shorter propagation delay → faster frequency • Smaller transistors → smaller capacitance → lower voltage • Moore’s law → Faster performance @ Constant power!

Single-Core Performance Scaling Projection What happened?

More AccurateProcessor Power Consumption Dynamic power Prominent culprit Static power

Deeper Look At Leakage Current Constants Constants Constants • Subthreshold leakage: power leaked before voltage reaches threshold • Can be reduced by increasing threshold voltage (Vth) or decreasing voltage (V) • Lower voltage at the same threshold voltage -> unstable circuit • Threshold voltage does not scale at low size (Without reducing frequency … long story) • -> Voltage cannot scale! Constants

End of Dennard Scaling • Option 1: Continue scaling frequency at increased power budget • Chip cannot be cooled! • Thermal runaway: Hotter chip -> increased resistance -> hotter chip -> …

Option 1: Continue Scaling Frequency at Increased Power Budget 0.014 μ

Option 2: Stop Frequency Scaling Dennard Scaling Ended (~2006) Danowitz et.al., “CPU DB: Recording Microprocessor History,” Communications of the ACM, 2012

Looking Back: Change of Predictions

But Moore’s Law Continues Beyond 2006

State of Things at This Point (2006) • Single-thread performance scaling ended • Frequency scaling ended (Dennard Scaling) • Instruction-level parallelism scaling stalled … also around 2005 • Moore’s law continues • Double transistors every two years • What do we do with them? Instruction Level Parallelism K. Olukotun, “Intel CPU Trends”

Crisis Averted With Manycores?

What Happened? Only knob left to turn Gate-oxide stopped scaling Stopped scaling due to leakage Dynamic power Stopped scaling due to leakage Static power

What Used to Be: Intel Tick-Tock Model What happened (2019): In 2016, Intel deprecated tick-tock in favor of “process–architecture–optimization” Skylake (14nm) now has five planned optimizations (Skylake, Kaby Lake, Coffee Lake, and Whiskey Lake) with no 10nm chip in production yet

Not All Transistors Can Be Active! • Utilization wall:“With each successive process generation, the percentage of a chip that can switch at full frequency drops exponentially due to power constraints.” -- Venkatesh, ASPLOS ‘10 • The following slides adapted from Michael Taylor’s 2012 talk“Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse” – Marked ‘**’

Tradeoffs Between Cores And Frequency** 4 cores @ 1.8 GHz Next generation … … 4 cores @ 2x1.8 GHz (12 cores dark) 4x4 cores @ .9 GHz (16 dim) 2x4 cores @ 1.8 GHz (8 cores dark, 8 dim)

The Four Horsemen** • What do we do with this dark silicon? • “Four top contenders, each of which seemed like an unlikely candidate from the beginning, carrying unwelcome burdens in design, manufacturing and programming. None is ideal, but each has its benefit and the optimal solution probably incorporates all four of them…”

The Shrinking Horseman (#1)** • “Area is expensive. Chip designers will just build smaller chips instead of having dark silicon in their designs!” • First, dark silicon doesn’t mean useless silicon, it just means it’s under-clocked or not used all of the time. • There’s lots of dark silicon in current chips: • On-chip GPU on AMD Fusion or Intel Sandybridge for GCC • • L3 cache is very dark for applications with small working sets • • SSE units for integer apps • • Many of the resources in FPGAs not used • by many designs (DSP blocks, PCI-E, Gig-E etc)

The Shrinking Horseman (#1)** • Competition and Margins • If there is an advantage to be had from using dark silicon, you have to use it too, to keep up with the Jones. • Diminished Returns (e.g., $10 silicon selling for $200 today) • Savings Exponentially Diminishing: $5, $2.5, $1.25, 63c • Overheads: packaging, test, marketing, etc. • Chip structures like I/O Pad Area do not scale • Exponential increase in Power Density -> Exponential Rise in Temperature • But, some chips will shrink • Nasty low margin, high competition chips; or a monopoly (Sony Cell)

The Dim Horseman (#2)** • Spatial dimming: Have enough cores to exceed power budget, but underclock them • Gen 1 & 2 Multicores (higher core count, lower freqs) • Near Threshold Voltage (NTV) Operation • Delay Loss > Energy Gain • But, make it up with lots of dim cores • Watch for Non-Ideal Speedups / Amdahl’s Law

The Dim Horseman (#2)** • Temporal Dimming : Have enough cores to exceed power budget, but use them only in bursts • Dim cores, but overclock if cold – e.g., Intel TurboBoost • E.g., ARM A15 Core in mobile phone • A15 power usage way above sustainable for phone. • 10 second bursts at most (big.LITTLE)

The Specialized Horseman (#3)** • “We will use all of that dark silicon area to build specialized cores, each of them tuned for the task at hand (10-100x more energy efficient), and only turn on the ones we need…” • Insights: • Power is now more expensive than area • Specialized logic can improve energy efficiency by 10-1000x

Fine-Grained Parallelism of Special-Purpose Circuits • Example -- Calculating gravitational force: • 8 instructions on a CPU → 8 cycles** • Much fewer cycles on a special purpose circuit A = G × m1 × m2 C = (y1 -y2)2 B = (x1 -x2)2 E = y1 -y2 A = G × m1 C = x1 -x2 D = B + C B = A × m2 D = C2 F = E2 Ret = B / G G = D + F 3 cycles with compound operations Ret = B / G May slow down clock Ret = (G × m1 × m2) / ((x1 -x2)2+ (y1 -y2)2) 4 cycles with basic operations 1 cycle with even further compound operations

The Specialized Horseman (#3)** • C-cores Approach: • Fill dark silicon with Conservation Cores, or c-cores, which are automatically-generated, specialized energy-saving coprocessors that save energy on common apps • Execution jumps among c-cores (hot code) and a host CPU (cold code) • Power-gate HW that is not currently in use • Coherent Memory & Patching Support for C-cores

Typical Energy Savings**

The Specialized Horseman (#3) -- Pssst • Another active thrust in this area is reconfigurable hardware acceleration using Field-Programmable Gate Arrays (FPGA) • A single FPGA fabric can be configured at runtime to act like any C-core • Not as efficient as a prefabricated C-core, but can cover any function at runtime • More on this later!

The Deus Ex Machina Horseman (#4)** • Deus Ex Machina: “A plot device whereby a seemingly unsolvable problem is suddenly and abruptly solved with the unexpected intervention of some new event, character, ability or object.” • “MOSFETs are the fundamental problem” • “FinFets, Trigate, High-K, nanotubes, 3D, for one-time improvements, but none are sustainable solutions across process generations.”

The Deus Ex Machina Horseman (#4)** • Possible “Beyond CMOS” DeviceDirections • Nano-electrical Mechanical Relays? • Tunnel Field Effect Transistors (TFETS)? • Spin-Transfer Torque MRAM (STT-MRAM)? • Graphene? • Human brain? • DNA Computing?

Where Do GPUs Fit into This? • GPUs have hundreds of threads running at multiple gigabytes! • Much simpler processor architecture • Dozens of threads scheduled together in a SIMD fashion • Much simpler microarchitecture (doesn’t need to boot Linux!) • Much higher power budget • CPUs try to maintain 100 W power budget (Pentium 4 till now) • GPUs regularly exceed 400 W

The Exascale Challenge Department of Energy requests an exaflop machine by 2020 MIT Research nuclear reactor 1,000,000,000,000,000,000 floating point operations per second 6 MW Using 2016 technology, 200 MW Lynn Freeny, Department of Energy

Coming Up • Before we go into newer technologies, let’s first make sure we make good use of what we have • SIMD (SSE, AVX), Cache-optimized code, etc • “Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims.” • Li et. al., Intel, “Architecting to Achieve a Billion Requests Per Second Throughput on a Single Key-Value Store Server Platform,” ISCA 2015 • Intel software implementation of memcached

References • Kim, Nam Sung, et al. "Leakage Current: Moore’s Law Meets Static Power"computer 12 (2003): 68-75.

CS295: Modern Systems The End of Conventional Processor Performance Scaling

CS295: Modern Systems The End of Conventional Processor Performance Scaling

Presentation Transcript

CS295: Modern Systems What Are FPGAs and Why Should You Care

Scaling at the End of Moore’s Law

Conventional Processor

Power-Aware Speed Scaling in Processor Sharing Systems

Parallel Multidimensional Scaling Performance on Multicore Systems

Scaling to the End of Silicon: Performance Projections and Promising Paths

Performance Analysis of Processor

Conventional Energy Systems

“Iron Law” of Processor Performance

Performance Analysis of Processor

Scaling and Performance

Evolution of Processor Performance

Modern processor design

Modern processor design

Improved performance of systems drives the scaling down of nanodevices.