130 likes | 268 Views
Bulldozer: An Approach to multithreaded Compute Performance . by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15. 마이크로 프로세서 구조 speaker: 박세준. Contents. Motivation Introduction
E N D
Bulldozer:An Approach to multithreaded Compute Performance by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서 구조 speaker: 박세준
Contents • Motivation • Introduction • Block diagram • Key features • Function block highlights • Bulldozer-based SoC
Motivation AMD has been focusing on the core count and highly parallel sever workloads • Two basic observations • Future SoCs support multiple execution threads • The smallest possible building module • Core would operate in constrained power environment. • Power reduction techniques: Filtering , speculation reduction, data movement minimization Performance per watt!!
Introduction Bulldozer is New direction in microarchitecture • Bulldozer is the first x86 design to share substantial hardware between multiple core • Bulldozer is a hierarchical design with sharing at nearly every level • Bulldozer is a high frequency optimized CPU • Instead of peak performance, average performance increased.
Introduction • Major contribution • Scaling the core structures • Aggressive frequency goal • low gates per clock
Block diagram • It combines two independent core as a module • implementation of a shared level 2 cache • Improved area and power efficiency • The module can fetch and decode up to four x86 instruction per clock. • Each core can services two loads per cycle. • Shared Frontend • Decoupled predict and fetch pipelines
Block diagram • ALU performance 33% decrease FPU performance 33% increase • ALU performance 33% increase FPU performance 33% increase
Key features • 1. Multithreading microarchitecture • Appropriate use of replication and shared hardware • Main advantage to sharing instruction cache and branch • Enforcing frontend (increasing ROB, BTB) • 2. Decoupled branch-prediction from instruction fetch pipelines • Enablement of instruction prefetch using the prediction queue • instruction control unit increased 128 (reorder buffer) • 3. Register renaming and operand delivery • scheduler and operand-handling is the biggest power consumer in the integer execution unit • PRF-based renaming microarchitecture for power efficiency • Eliminates data replication • 4. FMAC and media extension • FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth • It made one per each module like coprocessor
Function block highlights • Branch prediction • multilevel BTB • Instruction cache • 64 Kbyte, two-way set-associative, • cache shared between both threads
Function block highlights • Decode • branch fusion (intel: macro fusion ), four x86 instruction per cycle • Bulldozer execution pipeline
Function block highlights • Integer scheduler and execution • renaming by PRF(Physical Register Files) • Floating point • FPU is a coprocessor between two integer core • L2 cache • the two cores share the unified L2 cache
Bulldozer-based SoC • Summary • In single threading, sacrifice peak performance, throughput increase • In single threading, FPU is more important • ALU performance need in server • Bulldozer can deliver a significant performance improvement in the same power.