1 / 16

AMD Bulldozer Microarchitecture

AMD Bulldozer Microarchitecture. Overview. Two cores - to have high throughput per thread Bulldozer module can execute two threads via a combination of shared and dedicated resources. AMD’s design focuses on Multithreading . . High Level Block Diagram. The figure is taken from [ 3 ].

posy
Download Presentation

AMD Bulldozer Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMD Bulldozer Microarchitecture

  2. Overview • Two cores - to have high throughput per thread • Bulldozer module can execute two threads via a combination of shared and dedicated resources. • AMD’s design focuses on Multithreading.

  3. High Level Block Diagram The figure is taken from [3]

  4. Branch Prediction & Fetch • Prediction structures - shared between two threads • Multilevel BTBs • Guess!!! • Prediction runs ahead of the IF pipeline during fetch misses or other stalls. • Instruction is prefetched into L1 cache using the prediction queue.

  5. Decode • Fetch lines are queued in an instruction byte buffer. • Decode unit extracts and decodes up to four x86 instructions per cycle. • Decoded instructions dispatch to one integer core.

  6. Integer Core • Replicated (2 Integer Cores) • Scheduler handles out of order execution. • Core transparency • Avoids complexity • Lean Hardware

  7. Integer Core The figure is taken from [1]

  8. Floating Point Unit • Single floating point unit. • Shared between integer cores. • Floating point operations implemented in pipelined fashion & hence exploit SMT. • Interfaces with the decode unit for receiving cops and load/store unit for data transfer

  9. Floating Point Unit The figure is taken from [1]

  10. Register Renaming • PRF(Physical Register File)-based renaming • Table containing mappings of names to locations (tags). • Issued instructions execute after reading from PRF. • Uses snapshots for recovering from branch mispredictions/ exceptions. • Separate register files for integer cores and floating point unit.

  11. Register Renaming • Advantages • Eliminates data replication by not using distributed reservation stations. • Less overhead of CDB. • Disadvantages • Increase in latency as the tags are fetched instead of the values. • Complicated recovery mechanism for branch misprediction.

  12. Multithreading • Shared front end (vertical multithreading) • Larger resource in single thread mode • Utilize fetch bandwidth • Dedicated integer execution core (single thread) • Keep the integer execution core small and simple • Possible to run in a higher frequency • Shared FPU (SMT) • Consumes a great deal of area and power • Rarely utilized to the full capacity • Shared L2 (thread agnostic) • Good when 2 threads share instruction/data image

  13. Cache Hierarchy The figure is taken from [1]

  14. TLB Hierarchy The figure is taken from [1]

  15. Conclusion • Decoupled branch prediction and instruction fetch enables the instruction prefetch • By using PRF-based renaming it is power efficient • Non-conventional Multithreading

  16. References [1] Bulldozer: An Apporach To Multithreaded Compute Performance http://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf (2011) [2] AMD Bulldozer Microarchitecture http://www.realworldtech.com/bulldozer/ (2010) [3] Bulldozer (microarchitecture) http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture) [4] Register Renaming http://en.wikipedia.org/wiki/Register_renaming

More Related