160 likes | 369 Views
AMD Bulldozer Microarchitecture. Overview. Two cores - to have high throughput per thread Bulldozer module can execute two threads via a combination of shared and dedicated resources. AMD’s design focuses on Multithreading . . High Level Block Diagram. The figure is taken from [ 3 ].
E N D
Overview • Two cores - to have high throughput per thread • Bulldozer module can execute two threads via a combination of shared and dedicated resources. • AMD’s design focuses on Multithreading.
High Level Block Diagram The figure is taken from [3]
Branch Prediction & Fetch • Prediction structures - shared between two threads • Multilevel BTBs • Guess!!! • Prediction runs ahead of the IF pipeline during fetch misses or other stalls. • Instruction is prefetched into L1 cache using the prediction queue.
Decode • Fetch lines are queued in an instruction byte buffer. • Decode unit extracts and decodes up to four x86 instructions per cycle. • Decoded instructions dispatch to one integer core.
Integer Core • Replicated (2 Integer Cores) • Scheduler handles out of order execution. • Core transparency • Avoids complexity • Lean Hardware
Integer Core The figure is taken from [1]
Floating Point Unit • Single floating point unit. • Shared between integer cores. • Floating point operations implemented in pipelined fashion & hence exploit SMT. • Interfaces with the decode unit for receiving cops and load/store unit for data transfer
Floating Point Unit The figure is taken from [1]
Register Renaming • PRF(Physical Register File)-based renaming • Table containing mappings of names to locations (tags). • Issued instructions execute after reading from PRF. • Uses snapshots for recovering from branch mispredictions/ exceptions. • Separate register files for integer cores and floating point unit.
Register Renaming • Advantages • Eliminates data replication by not using distributed reservation stations. • Less overhead of CDB. • Disadvantages • Increase in latency as the tags are fetched instead of the values. • Complicated recovery mechanism for branch misprediction.
Multithreading • Shared front end (vertical multithreading) • Larger resource in single thread mode • Utilize fetch bandwidth • Dedicated integer execution core (single thread) • Keep the integer execution core small and simple • Possible to run in a higher frequency • Shared FPU (SMT) • Consumes a great deal of area and power • Rarely utilized to the full capacity • Shared L2 (thread agnostic) • Good when 2 threads share instruction/data image
Cache Hierarchy The figure is taken from [1]
TLB Hierarchy The figure is taken from [1]
Conclusion • Decoupled branch prediction and instruction fetch enables the instruction prefetch • By using PRF-based renaming it is power efficient • Non-conventional Multithreading
References [1] Bulldozer: An Apporach To Multithreaded Compute Performance http://home.dei.polimi.it/sami/architetture_avanzate/AMDbulldozer.pdf (2011) [2] AMD Bulldozer Microarchitecture http://www.realworldtech.com/bulldozer/ (2010) [3] Bulldozer (microarchitecture) http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture) [4] Register Renaming http://en.wikipedia.org/wiki/Register_renaming