1 / 42

Challenges in High-Performance Computer Architecture

This article explores the challenges faced in high-performance computer architecture and discusses solutions to handle these challenges. Topics include clock speed improvements, parallelism, transistor advancements, and power consumption.

Download Presentation

Challenges in High-Performance Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Computer Architecture Challenges RajeevBalasubramonian School of Computing, University of Utah

  2. Dramatic Clock Speed Improvements!! Intel Pentium 4 3.2 GHz The 1st Intel processor 108 KHz

  3. Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed • than the IBM Power4 – does the Pentium4 • execute your program faster?

  4. Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed • than the IBM Power4 – does the Pentium4 • execute your program faster? Case 1: Completing instruction Clock tick Case 2: Time

  5. Performance = Clock Speed x Parallelism

  6. What About Parallelism?

  7. Dramatic Clock Speed Improvements!! Intel Pentium 4 3.2 GHz The 1st Intel processor 108 KHz

  8. The Basic Pipeline Consider an automobile assembly line: A new car rolls out every day Stage 1 Stage 2 Stage 3 Stage 4 1 day 1 day 1 day 1 day A new car rolls out every half day In each case, it takes 4 days to build a car, but… More stages  more parallelism and less time between cars

  9. What Determines Clock Speed? • Clock speed is a function of work done in each • stage – in the earlier examples, the clock speeds • were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an • instruction and this work is broken into stages Execution of a single instruction

  10. What Determines Clock Speed? • Clock speed is a function of work done in each • stage – in the earlier examples, the clock speeds • were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an • instruction and this work is broken into stages 250ps  4GHz clock speed Execution of a single instruction

  11. Clock Speed Improvements • Why have we seen such dramatic improvements • in clock speed? • work has been broken up into more stages • early Intel chips executed work equivalent to approximately 56 logic gates; today’s chips execute 12 logic gates worth of work • transistors have been becoming faster • as technology improves, we can draw smaller and smaller transistors/gates on a chip and that improves their speed (doubles every 5-6 years)

  12. Will these Improvements Continue? • Transistors will continue to shrink and become • faster for at least 10 more years • Each pipeline stage is already pretty small – • improvements from this factor will cease • If clock speed improvements stagnate, should • we turn our focus to parallelism?

  13. Microprocessor Blocks Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

  14. Innovations: Branch Predictor Improve prediction accuracy by detecting frequent patterns Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

  15. Innovations: Out-of-order Issue Out-of-order issue: if later instructions do not depend on earlier ones, execute them first Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

  16. Innovations: Superscalar Architectures Multiple ALUs: increase execution bandwidth Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

  17. Innovations: Data Caches 2K papers on caches: efficient data layout, stride prefetching Branch Predictor L1 Instr Cache Decode & Rename Issue Logic L2 Cache L1 Data Cache ALU ALU ALU ALU Register File

  18. Summary • Historically, computer engineers have focused on • performance • Performance is a function of clock speed and • parallelism • As technology improves, clock speeds will • improve, although at a slower rate • Parallelism has been gradually improving and • plenty of low-hanging fruits have been picked

  19. Outline • Recent Microprocessor History • Current Trends and Challenges • Solutions to Handling these Challenges

  20. Trend I : An Opportunity • Transistors on a chip have been doubling every • two years (Moore’s Law) • In the past, transistors have been used for • out-of-order logic, large caches, etc… • In the future, transistors can be employed for • multiple processors on a single chip

  21. Chip Multiprocessors (CMP) • The IBM Power4 has two processors on a die • Sun has announced the 8-processor Niagara P1 P2 P3 P4 L2 cache

  22. The Challenge • Nearly every chip will have multiple processors, • but where are the threads? • Some applications will truly benefit – they can be • easily decomposed into threads • Some applications are inherently sequential – can • we execute speculative threads to speed up these • programs? (open problem!)

  23. Trend II : Power Consumption • Power a a f C V2 , where a is activity factor, • f is frequency, C is capacitance, and V is voltage • Every new chip has higher frequency, more • transistors (higher C), and slightly lower voltage – • the net result is an increase in power consumption

  24. Scary Slide! • Power density cannot be allowed to increase at • current rates (Source: Borkar et al., Intel)

  25. Impact of Power Increases • Well, UtahPower sends you fatter bills every month • To maintain constant chip temperature, heat • produced on a chip has to be dissipated away – • every additional watt increases cooling cost of a • chip by approximately $4 !! • If temperature of a chip rises, the power dissipated • also increases (almost exponentially)  a vicious • cycle!

  26. Trend III : Wire Delays • As technology improves, logic gates shrink  • their speed increases and clock speeds improve • As logic gates shrink, wires shrink too – • unfortunately, their speed improves only • marginally • In relative terms, future chips will have fast • transistors/gates and slow wires • Computation is cheap, communication is expensive!

  27. Impact of Wire Delays • Crossing the chip used to take one cycle • In the future, crossing the chip can take up to 30 • cycles • Many structures on a chip are wire-constrained • (register file, cache) – their access times slow • down  throughput decreases as instructions • sit around waiting for values • Long wires also consume power

  28. Trend IV : Soft Errors • High energy particles constantly collide with • objects and deposit charge • Transistors are becoming smaller and on-chip • voltages are being lowered  it doesn’t take much • to toggle the state of the transistor • The frequency of this occurrence is projected to • increase by nine orders of magnitude over a 20 • year period

  29. Impact of Soft Errors • When a particle strike occurs, the component is • not rendered permanently faulty – only the value • it contains is erroneous • Hence, this is termed a transient fault or soft error • The error propagates when other instructions read • this faulty value • This is already a problem for mission-critical apps • (space, defense, highly-available servers) and may • soon be a problem in other domains

  30. Summary of Trends • More transistors, more processors on a single chip • High power consumption • Long wire delays • Frequent soft errors • We are attempting to exploit transistors to increase • parallelism – in light of the above challenges, we’d • be happy to even preserve parallelism

  31. Transistors & Wire Delays • Bring in a large window of instructions so you • can find high parallelism • Distribute instructions across processors so that • communication is minimized Instructions Processors

  32. Difficult Branches • Mispredicted branches result in poor parallelism • and wasted work (power) • Solution: when you arrive at a fork, take both • directions – execute on low frequency units to • control power dissipation levels Instructions Processors

  33. Thermal Emergencies • Heterogeneous units allow you to reduce cooling • costs • If a chip’s peak power is 110W, allow enough • cooling to handle 100W average – save $40/chip! • If the application starts consuming more than • 100W and temperature starts to rise, start • favoring the low power processor cores – • intelligent management allows you to make • forward progress even in a thermal emergency

  34. Handling Long Wire Delays • Wires can be designed to have different properties • Knob 1: wire width and spacing: fat wires are • faster, but have low bandwidth

  35. Handling Wire Capacitance • Knob 2: wires have repeaters/buffers – many, • large buffers  low delay, high power consumption

  36. Mapping Data to Wires • We can optimize wires for delay, bandwidth, power • Different data transfers on a chip have different • latency and bandwidth needs – an intelligent • mapping of data to wires can improve performance • and lower power consumption

  37. Handling Soft Errors • Errors can be detected and corrected by providing • redundancy – execute two copies of a program • (perhaps, on a CMP) and compare results • Note that this doubles power consumption! Leading Thread Trailing Thread

  38. Handling Soft Errors • Trailing thread is capable of higher performance • than leading thread – but there’s no point catching • up – hence, artificially slow the trailing thread by • lowering its frequency  lower power dissipation Peak thruput: 1 BIPS 2 BIPS Trailing thread never fetches data from memory and never guesses at branches Leading Thread Trailing Thread

  39. Summary of Solutions • Heterogeneous wires and processors • Instructions and data have different needs: map • them to appropriate wires and processors • Note how these solutions target multiple issues • simultaneously: slow wires, many transistors, • soft errors, power/thermal emergencies

  40. Conclusions • Performance has improved because of clock • speed and parallelism advances • Clock speed improvements will continue at a • slower rate • Parallelism is on a downward trend because of • technology trends and because low-hanging • fruits have been picked • We must find creative ways to preserve or even • improve parallelism in the future

  41. Slide Title • Point 1.

More Related