Bulldozer: An Approach to multithreaded Compute Performance

Bulldozer:An Approach to multithreaded Compute Performance by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서 구조 speaker: 박세준

Contents • Motivation • Introduction • Block diagram • Key features • Function block highlights • Bulldozer-based SoC

Motivation AMD has been focusing on the core count and highly parallel sever workloads • Two basic observations • Future SoCs support multiple execution threads • The smallest possible building module • Core would operate in constrained power environment. • Power reduction techniques: Filtering , speculation reduction, data movement minimization Performance per watt!!

Introduction Bulldozer is New direction in microarchitecture • Bulldozer is the first x86 design to share substantial hardware between multiple core • Bulldozer is a hierarchical design with sharing at nearly every level • Bulldozer is a high frequency optimized CPU • Instead of peak performance, average performance increased.

Introduction • Major contribution • Scaling the core structures • Aggressive frequency goal • low gates per clock

Block diagram • It combines two independent core as a module • implementation of a shared level 2 cache • Improved area and power efficiency • The module can fetch and decode up to four x86 instruction per clock. • Each core can services two loads per cycle. • Shared Frontend • Decoupled predict and fetch pipelines

Block diagram • ALU performance 33% decrease FPU performance 33% increase • ALU performance 33% increase FPU performance 33% increase

Key features • 1. Multithreading microarchitecture • Appropriate use of replication and shared hardware • Main advantage to sharing instruction cache and branch • Enforcing frontend (increasing ROB, BTB) • 2. Decoupled branch-prediction from instruction fetch pipelines • Enablement of instruction prefetch using the prediction queue • instruction control unit increased 128 (reorder buffer) • 3. Register renaming and operand delivery • scheduler and operand-handling is the biggest power consumer in the integer execution unit • PRF-based renaming microarchitecture for power efficiency • Eliminates data replication • 4. FMAC and media extension • FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth • It made one per each module like coprocessor

Function block highlights • Branch prediction • multilevel BTB • Instruction cache • 64 Kbyte, two-way set-associative, • cache shared between both threads

Function block highlights • Decode • branch fusion (intel: macro fusion ), four x86 instruction per cycle • Bulldozer execution pipeline

Function block highlights • Integer scheduler and execution • renaming by PRF(Physical Register Files) • Floating point • FPU is a coprocessor between two integer core • L2 cache • the two cores share the unified L2 cache

Bulldozer-based SoC • Summary • In single threading, sacrifice peak performance, throughput increase • In single threading, FPU is more important • ALU performance need in server • Bulldozer can deliver a significant performance improvement in the same power.

The end

Bulldozer: An Approach to multithreaded Compute Performance

Bulldozer: An Approach to multithreaded Compute Performance

Presentation Transcript

Candidates’ Performance in the 2009 Examination – Paper 1

Advanced Pay for Performance

Managing Performance

Contractor Past Performance Rating

Improving and Maximizing Employee Performance

Analyses and Optimizations for Multithreaded Programs

Approach to Dyspnea

On-the-Fly Data-Race Detection in Multithreaded Programs

Behavior Change and Organizational Sustainability: An Evidence-Based Approach

Multithreaded Programming using Java Threads

CIS 550, Fall 2001 Handout 2.

A Holistic Approach To Performance Tuning Oracle Applications Release 11 and 11i

Performance Tuning Tips

Performance Management Using the Balanced Scorecard Approach

Exploiting the Graphics Hardware to solve two compute intensive problems

Introduction to Valuation: The Time Value of Money

Discounted Cash Flow Valuation

SITXHRM006A: Monitor staff performance

R adial Approach to Coronary Angiography and PCI

A New Approach for the Performance Based Seismic Design of Structures

CNT 4714: Enterprise Computing Fall 2014 Programming Multithreaded Applications in Java