Challenges in High-Performance Computer Architecture

High Performance Computer Architecture Challenges RajeevBalasubramonian School of Computing, University of Utah

Dramatic Clock Speed Improvements!! Intel Pentium 4 3.2 GHz The 1st Intel processor 108 KHz

Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed • than the IBM Power4 – does the Pentium4 • execute your program faster?

Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed • than the IBM Power4 – does the Pentium4 • execute your program faster? Case 1: Completing instruction Clock tick Case 2: Time

Performance = Clock Speed x Parallelism

What About Parallelism?

Dramatic Clock Speed Improvements!! Intel Pentium 4 3.2 GHz The 1st Intel processor 108 KHz

The Basic Pipeline Consider an automobile assembly line: A new car rolls out every day Stage 1 Stage 2 Stage 3 Stage 4 1 day 1 day 1 day 1 day A new car rolls out every half day In each case, it takes 4 days to build a car, but… More stages  more parallelism and less time between cars

What Determines Clock Speed? • Clock speed is a function of work done in each • stage – in the earlier examples, the clock speeds • were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an • instruction and this work is broken into stages Execution of a single instruction

What Determines Clock Speed? • Clock speed is a function of work done in each • stage – in the earlier examples, the clock speeds • were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an • instruction and this work is broken into stages 250ps  4GHz clock speed Execution of a single instruction

Clock Speed Improvements • Why have we seen such dramatic improvements • in clock speed? • work has been broken up into more stages • early Intel chips executed work equivalent to approximately 56 logic gates; today’s chips execute 12 logic gates worth of work • transistors have been becoming faster • as technology improves, we can draw smaller and smaller transistors/gates on a chip and that improves their speed (doubles every 5-6 years)

Will these Improvements Continue? • Transistors will continue to shrink and become • faster for at least 10 more years • Each pipeline stage is already pretty small – • improvements from this factor will cease • If clock speed improvements stagnate, should • we turn our focus to parallelism?

Microprocessor Blocks Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

Innovations: Branch Predictor Improve prediction accuracy by detecting frequent patterns Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

Innovations: Out-of-order Issue Out-of-order issue: if later instructions do not depend on earlier ones, execute them first Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

Innovations: Superscalar Architectures Multiple ALUs: increase execution bandwidth Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

Innovations: Data Caches 2K papers on caches: efficient data layout, stride prefetching Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

Summary • Historically, computer engineers have focused on • performance • Performance is a function of clock speed and • parallelism • As technology improves, clock speeds will • improve, although at a slower rate • Parallelism has been gradually improving and • plenty of low-hanging fruits have been picked

Outline • Recent Microprocessor History • Current Trends and Challenges • Solutions to Handling these Challenges

Trend I : An Opportunity • Transistors on a chip have been doubling every • two years (Moore’s Law) • In the past, transistors have been used for • out-of-order logic, large caches, etc… • In the future, transistors can be employed for • multiple processors on a single chip

Chip Multiprocessors (CMP) • The IBM Power4 has two processors on a die • Sun has announced the 8-processor Niagara P1 P2 P3 P4 L2 cache

The Challenge • Nearly every chip will have multiple processors, • but where are the threads? • Some applications will truly benefit – they can be • easily decomposed into threads • Some applications are inherently sequential – can • we execute speculative threads to speed up these • programs? (open problem!)

Trend II : Power Consumption • Power a a f C V2 , where a is activity factor, • f is frequency, C is capacitance, and V is voltage • Every new chip has higher frequency, more • transistors (higher C), and slightly lower voltage – • the net result is an increase in power consumption

Scary Slide! • Power density cannot be allowed to increase at • current rates (Source: Borkar et al., Intel)

Impact of Power Increases • Well, UtahPower sends you fatter bills every month • To maintain constant chip temperature, heat • produced on a chip has to be dissipated away – • every additional watt increases cooling cost of a • chip by approximately $4 !! • If temperature of a chip rises, the power dissipated • also increases (almost exponentially)  a vicious • cycle!

Trend III : Wire Delays • As technology improves, logic gates shrink  • their speed increases and clock speeds improve • As logic gates shrink, wires shrink too – • unfortunately, their speed improves only • marginally • In relative terms, future chips will have fast • transistors/gates and slow wires • Computation is cheap, communication is expensive!

Impact of Wire Delays • Crossing the chip used to take one cycle • In the future, crossing the chip can take up to 30 • cycles • Many structures on a chip are wire-constrained • (register file, cache) – their access times slow • down  throughput decreases as instructions • sit around waiting for values • Long wires also consume power

Trend IV : Soft Errors • High energy particles constantly collide with • objects and deposit charge • Transistors are becoming smaller and on-chip • voltages are being lowered  it doesn’t take much • to toggle the state of the transistor • The frequency of this occurrence is projected to • increase by nine orders of magnitude over a 20 • year period

Impact of Soft Errors • When a particle strike occurs, the component is • not rendered permanently faulty – only the value • it contains is erroneous • Hence, this is termed a transient fault or soft error • The error propagates when other instructions read • this faulty value • This is already a problem for mission-critical apps • (space, defense, highly-available servers) and may • soon be a problem in other domains

Summary of Trends • More transistors, more processors on a single chip • High power consumption • Long wire delays • Frequent soft errors • We are attempting to exploit transistors to increase • parallelism – in light of the above challenges, we’d • be happy to even preserve parallelism

Transistors & Wire Delays • Bring in a large window of instructions so you • can find high parallelism • Distribute instructions across processors so that • communication is minimized Instructions Processors

Difficult Branches • Mispredicted branches result in poor parallelism • and wasted work (power) • Solution: when you arrive at a fork, take both • directions – execute on low frequency units to • control power dissipation levels Instructions Processors

Thermal Emergencies • Heterogeneous units allow you to reduce cooling • costs • If a chip’s peak power is 110W, allow enough • cooling to handle 100W average – save $40/chip! • If the application starts consuming more than • 100W and temperature starts to rise, start • favoring the low power processor cores – • intelligent management allows you to make • forward progress even in a thermal emergency

Handling Long Wire Delays • Wires can be designed to have different properties • Knob 1: wire width and spacing: fat wires are • faster, but have low bandwidth

Handling Wire Capacitance • Knob 2: wires have repeaters/buffers – many, • large buffers  low delay, high power consumption

Mapping Data to Wires • We can optimize wires for delay, bandwidth, power • Different data transfers on a chip have different • latency and bandwidth needs – an intelligent • mapping of data to wires can improve performance • and lower power consumption

Handling Soft Errors • Errors can be detected and corrected by providing • redundancy – execute two copies of a program • (perhaps, on a CMP) and compare results • Note that this doubles power consumption! Leading Thread Trailing Thread

Handling Soft Errors • Trailing thread is capable of higher performance • than leading thread – but there’s no point catching • up – hence, artificially slow the trailing thread by • lowering its frequency  lower power dissipation Peak thruput: 1 BIPS 2 BIPS Trailing thread never fetches data from memory and never guesses at branches Leading Thread Trailing Thread

Summary of Solutions • Heterogeneous wires and processors • Instructions and data have different needs: map • them to appropriate wires and processors • Note how these solutions target multiple issues • simultaneously: slow wires, many transistors, • soft errors, power/thermal emergencies

Conclusions • Performance has improved because of clock • speed and parallelism advances • Clock speed improvements will continue at a • slower rate • Parallelism is on a downward trend because of • technology trends and because low-hanging • fruits have been picked • We must find creative ways to preserve or even • improve parallelism in the future

Slide Title • Point 1.

Challenges in High-Performance Computer Architecture

Challenges in High-Performance Computer Architecture

Presentation Transcript

Lecture 1 An Overview of High-Performance Computer Architecture

High Performance Cloud Storage Technical Architecture

Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah http://www.cs.uta

A High-Performance Scalable Graphics Architecture

High Performance Computer Architecture for e-Business and e-Commerce

A High-Performance Brain-Computer Interface

Manjunath Shevgoor , Niladrish Chatterjee , Rajeev Balasubramonian , Al Davis

Lecture 1 An Overview of High-Performance Computer Architecture

Manu Awasthi , Kshitij Sudan, Rajeev Balasubramonian, John Carter University of Utah

High Performance Processor Architecture

HIGH-LEVEL LANGUAGE COMPUTER ARCHITECTURE

COSC 5341 High-Performance Computer Networks

Challenges for High Performance Processors

ECE C61 Computer Architecture Lecture 2 – performance

High-Performance Computer Architecture

High Performance Computing Challenges and Trends

A HIGH PERFORMANCE VLSI FFT ARCHITECTURE

Computer Architecture Research Overview Focus on: Transactional Memory Rajeev Balasubramonian

Compiler Challenges for High Performance Architectures

Aniruddha N. Udipi Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi *

High Performance Computing Challenges and Trends