AMD Bulldozer Microarchitecture

AMD Bulldozer Microarchitecture

Overview • Two cores - to have high throughput per thread • Bulldozer module can execute two threads via a combination of shared and dedicated resources. • AMD’s design focuses on Multithreading.

High Level Block Diagram The figure is taken from [3]

Branch Prediction & Fetch • Prediction structures - shared between two threads • Multilevel BTBs • Guess!!! • Prediction runs ahead of the IF pipeline during fetch misses or other stalls. • Instruction is prefetched into L1 cache using the prediction queue.

Decode • Fetch lines are queued in an instruction byte buffer. • Decode unit extracts and decodes up to four x86 instructions per cycle. • Decoded instructions dispatch to one integer core.

Integer Core • Replicated (2 Integer Cores) • Scheduler handles out of order execution. • Core transparency • Avoids complexity • Lean Hardware

Integer Core The figure is taken from [1]

Floating Point Unit • Single floating point unit. • Shared between integer cores. • Floating point operations implemented in pipelined fashion & hence exploit SMT. • Interfaces with the decode unit for receiving cops and load/store unit for data transfer

Floating Point Unit The figure is taken from [1]

Register Renaming • PRF(Physical Register File)-based renaming • Table containing mappings of names to locations (tags). • Issued instructions execute after reading from PRF. • Uses snapshots for recovering from branch mispredictions/ exceptions. • Separate register files for integer cores and floating point unit.

Register Renaming • Advantages • Eliminates data replication by not using distributed reservation stations. • Less overhead of CDB. • Disadvantages • Increase in latency as the tags are fetched instead of the values. • Complicated recovery mechanism for branch misprediction.

Multithreading • Shared front end (vertical multithreading) • Larger resource in single thread mode • Utilize fetch bandwidth • Dedicated integer execution core (single thread) • Keep the integer execution core small and simple • Possible to run in a higher frequency • Shared FPU (SMT) • Consumes a great deal of area and power • Rarely utilized to the full capacity • Shared L2 (thread agnostic) • Good when 2 threads share instruction/data image

Cache Hierarchy The figure is taken from [1]

TLB Hierarchy The figure is taken from [1]

Conclusion • Decoupled branch prediction and instruction fetch enables the instruction prefetch • By using PRF-based renaming it is power efficient • Non-conventional Multithreading

References [1] Bulldozer: An Apporach To Multithreaded Compute Performance http://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf (2011) [2] AMD Bulldozer Microarchitecture http://www.realworldtech.com/bulldozer/ (2010) [3] Bulldozer (microarchitecture) http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture) [4] Register Renaming http://en.wikipedia.org/wiki/Register_renaming

AMD Bulldozer Microarchitecture

AMD Bulldozer Microarchitecture

Presentation Transcript

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

MicroArchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

Advanced Microarchitecture

AMD Bulldozer Microarchitecture

Advanced Microarchitecture

AMD

Microarchitecture

Cairngorm Microarchitecture

Advanced Microarchitecture

Microarchitecture

(Microarchitecture is dead . Long live microarchitecture!)