420 likes | 439 Views
This article explores the challenges faced in high-performance computer architecture and discusses solutions to handle these challenges. Topics include clock speed improvements, parallelism, transistor advancements, and power consumption.
E N D
High Performance Computer Architecture Challenges RajeevBalasubramonian School of Computing, University of Utah
Dramatic Clock Speed Improvements!! Intel Pentium 4 3.2 GHz The 1st Intel processor 108 KHz
Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed • than the IBM Power4 – does the Pentium4 • execute your program faster?
Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed • than the IBM Power4 – does the Pentium4 • execute your program faster? Case 1: Completing instruction Clock tick Case 2: Time
Dramatic Clock Speed Improvements!! Intel Pentium 4 3.2 GHz The 1st Intel processor 108 KHz
The Basic Pipeline Consider an automobile assembly line: A new car rolls out every day Stage 1 Stage 2 Stage 3 Stage 4 1 day 1 day 1 day 1 day A new car rolls out every half day In each case, it takes 4 days to build a car, but… More stages more parallelism and less time between cars
What Determines Clock Speed? • Clock speed is a function of work done in each • stage – in the earlier examples, the clock speeds • were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an • instruction and this work is broken into stages Execution of a single instruction
What Determines Clock Speed? • Clock speed is a function of work done in each • stage – in the earlier examples, the clock speeds • were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an • instruction and this work is broken into stages 250ps 4GHz clock speed Execution of a single instruction
Clock Speed Improvements • Why have we seen such dramatic improvements • in clock speed? • work has been broken up into more stages • early Intel chips executed work equivalent to approximately 56 logic gates; today’s chips execute 12 logic gates worth of work • transistors have been becoming faster • as technology improves, we can draw smaller and smaller transistors/gates on a chip and that improves their speed (doubles every 5-6 years)
Will these Improvements Continue? • Transistors will continue to shrink and become • faster for at least 10 more years • Each pipeline stage is already pretty small – • improvements from this factor will cease • If clock speed improvements stagnate, should • we turn our focus to parallelism?
Microprocessor Blocks Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File
Innovations: Branch Predictor Improve prediction accuracy by detecting frequent patterns Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File
Innovations: Out-of-order Issue Out-of-order issue: if later instructions do not depend on earlier ones, execute them first Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File
Innovations: Superscalar Architectures Multiple ALUs: increase execution bandwidth Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File
Innovations: Data Caches 2K papers on caches: efficient data layout, stride prefetching Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File
Summary • Historically, computer engineers have focused on • performance • Performance is a function of clock speed and • parallelism • As technology improves, clock speeds will • improve, although at a slower rate • Parallelism has been gradually improving and • plenty of low-hanging fruits have been picked
Outline • Recent Microprocessor History • Current Trends and Challenges • Solutions to Handling these Challenges
Trend I : An Opportunity • Transistors on a chip have been doubling every • two years (Moore’s Law) • In the past, transistors have been used for • out-of-order logic, large caches, etc… • In the future, transistors can be employed for • multiple processors on a single chip
Chip Multiprocessors (CMP) • The IBM Power4 has two processors on a die • Sun has announced the 8-processor Niagara P1 P2 P3 P4 L2 cache
The Challenge • Nearly every chip will have multiple processors, • but where are the threads? • Some applications will truly benefit – they can be • easily decomposed into threads • Some applications are inherently sequential – can • we execute speculative threads to speed up these • programs? (open problem!)
Trend II : Power Consumption • Power a a f C V2 , where a is activity factor, • f is frequency, C is capacitance, and V is voltage • Every new chip has higher frequency, more • transistors (higher C), and slightly lower voltage – • the net result is an increase in power consumption
Scary Slide! • Power density cannot be allowed to increase at • current rates (Source: Borkar et al., Intel)
Impact of Power Increases • Well, UtahPower sends you fatter bills every month • To maintain constant chip temperature, heat • produced on a chip has to be dissipated away – • every additional watt increases cooling cost of a • chip by approximately $4 !! • If temperature of a chip rises, the power dissipated • also increases (almost exponentially) a vicious • cycle!
Trend III : Wire Delays • As technology improves, logic gates shrink • their speed increases and clock speeds improve • As logic gates shrink, wires shrink too – • unfortunately, their speed improves only • marginally • In relative terms, future chips will have fast • transistors/gates and slow wires • Computation is cheap, communication is expensive!
Impact of Wire Delays • Crossing the chip used to take one cycle • In the future, crossing the chip can take up to 30 • cycles • Many structures on a chip are wire-constrained • (register file, cache) – their access times slow • down throughput decreases as instructions • sit around waiting for values • Long wires also consume power
Trend IV : Soft Errors • High energy particles constantly collide with • objects and deposit charge • Transistors are becoming smaller and on-chip • voltages are being lowered it doesn’t take much • to toggle the state of the transistor • The frequency of this occurrence is projected to • increase by nine orders of magnitude over a 20 • year period
Impact of Soft Errors • When a particle strike occurs, the component is • not rendered permanently faulty – only the value • it contains is erroneous • Hence, this is termed a transient fault or soft error • The error propagates when other instructions read • this faulty value • This is already a problem for mission-critical apps • (space, defense, highly-available servers) and may • soon be a problem in other domains
Summary of Trends • More transistors, more processors on a single chip • High power consumption • Long wire delays • Frequent soft errors • We are attempting to exploit transistors to increase • parallelism – in light of the above challenges, we’d • be happy to even preserve parallelism
Transistors & Wire Delays • Bring in a large window of instructions so you • can find high parallelism • Distribute instructions across processors so that • communication is minimized Instructions Processors
Difficult Branches • Mispredicted branches result in poor parallelism • and wasted work (power) • Solution: when you arrive at a fork, take both • directions – execute on low frequency units to • control power dissipation levels Instructions Processors
Thermal Emergencies • Heterogeneous units allow you to reduce cooling • costs • If a chip’s peak power is 110W, allow enough • cooling to handle 100W average – save $40/chip! • If the application starts consuming more than • 100W and temperature starts to rise, start • favoring the low power processor cores – • intelligent management allows you to make • forward progress even in a thermal emergency
Handling Long Wire Delays • Wires can be designed to have different properties • Knob 1: wire width and spacing: fat wires are • faster, but have low bandwidth
Handling Wire Capacitance • Knob 2: wires have repeaters/buffers – many, • large buffers low delay, high power consumption
Mapping Data to Wires • We can optimize wires for delay, bandwidth, power • Different data transfers on a chip have different • latency and bandwidth needs – an intelligent • mapping of data to wires can improve performance • and lower power consumption
Handling Soft Errors • Errors can be detected and corrected by providing • redundancy – execute two copies of a program • (perhaps, on a CMP) and compare results • Note that this doubles power consumption! Leading Thread Trailing Thread
Handling Soft Errors • Trailing thread is capable of higher performance • than leading thread – but there’s no point catching • up – hence, artificially slow the trailing thread by • lowering its frequency lower power dissipation Peak thruput: 1 BIPS 2 BIPS Trailing thread never fetches data from memory and never guesses at branches Leading Thread Trailing Thread
Summary of Solutions • Heterogeneous wires and processors • Instructions and data have different needs: map • them to appropriate wires and processors • Note how these solutions target multiple issues • simultaneously: slow wires, many transistors, • soft errors, power/thermal emergencies
Conclusions • Performance has improved because of clock • speed and parallelism advances • Clock speed improvements will continue at a • slower rate • Parallelism is on a downward trend because of • technology trends and because low-hanging • fruits have been picked • We must find creative ways to preserve or even • improve parallelism in the future
Slide Title • Point 1.