200 likes | 638 Views
AMD Opteron Overview. Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v). Introduction. AMD Opteron Focuses on Barcelona Barcelona is AMD’s 65nm 4-core CPU. Fetch. Fetches 32B from L1 cache to pre-decode/Pick buffer
E N D
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Introduction • AMD Opteron • Focuses on Barcelona • Barcelona is AMD’s 65nm 4-core CPU
Fetch • Fetches 32B from L1 cache to pre-decode/Pick buffer • For simplicity, the Barcelona uses pre-decode information to mark the end of an instruction.
Inst. Decode • The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle • Inserts pre-decode information from the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions • Instructions are then passed through the sideband stack optimizer • x86 includes instructions to directly manipulate the stack of each thread • AMD introduced a side-band stack optimizer to remove these stack manipulations from the instruction stream • Thus, many stack operations can be processed in parallel • Frees up the reservation stations, re-order buffers, and regular ALUs for other work
Branch Prediction • Branch selector chooses between a bi-modal predictor and a global predictor • The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information • The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register • Tracks last 12 branches with a 16K entry prediction table containing 2 bit saturating counters • The branch target address calculator (BTAC) checks the targets for relative branches • Can correct mis-predictions with a two cycle penalty. • Barcelona uses an indirect predictor • Specifically designed to handle branches with multiple targets (e.g. switch or case statements) • Return address stack has 24 entries
Pipeline • Uses a 12 stage pipeline
OO (ROB) • The Pack Buffer (post-decoding buffer) sends groups of 3 micro-ops to the re-order buffer (ROB) • The re-order buffer contains 24 entries, with 3 lanes per entry • Holds a total of 72 instructions • Instructions can be moved between lanes to avoid a congested reservation station or to observe issue restrictions • From the ROB, instructions issue to the appropriate scheduler
Integer Future File and Register File (IFFRF) • The IFFRF contains 40 registers broken up into three distinct sets • The Architectural Register File • Contains 16x64 bit non-speculative registers • Instructions can only modify the Architectural Register File until they are committed • Speculative instructions read from and write to the Future File • Contains the most recent speculative state of the 16 architectural instructions • The last 8 registers are scratchpad registers used by the microcode. • Should a branch mis-prediction or an exception occur, the pipeline rolls back, and architectural register file overwrites the contents of the Future File • There are three reservation stations, i.e. schedulers, within the integer cluster • Each station is tied to a specific lane in the ROB and holds 8 instructions
Integer Execution • Barcelona uses three symmetric ALUs which can execute almost any integer instruction • Three full featured ALUs require more die area and power • Can provide higher performance for certain edge cases • Enables a simpler design for the ROB and schedulers.
Floating Point Execution • Floating Point operations are first sent to the FP Mapper and Renamer • In the Renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 FP register file entries. • Once the micro-ops have been renamed, they may be issued to the three FP schedulers • Operands can be obtained from either the FP register file, or the forwarding network
Floating Point Execution (SIMD) • The FPUs are 128 bits wide so that Streaming SIMD Extension (SSE) instructions can execute in a single pass. • Similarly, the load-store units, and the FMISC unit load 128 bit wide data, to improve SSE performance.
Memory Hierarchy • 4 separate 128KB 2-way set associative L1 cache • Latency = 3 cycles • Write-back to L2 • The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive) • 4 separate 512KB 16-way set associative • Latency = 12 cycles • Line size is 64B
L3 Cache • Shared 2MB 32-way set associative L3 • Latency = 38 cycles • Uses 64B lines • The L3 cache was designed with data sharing in mind • When a line is requested, if it is likely to be shared, then it will remain in the L3 • This leads to duplication which would not happen in an exclusive hierarchy • In the past, a pseudo-LRU algorithm would evict the oldest line in the cache. • In Barcelona’s L3, the replacement algorithm has been changed to prefer evicting unshared lines • Access to the L3 must be arbitrated since the L3 is shared between four different cores • A round-robin algorithm is used to give access to one of the four cores each cycle. • Each core has 8 data prefetchers (a total of 32 per device) • Fill the L1D cache • Can have up to 2 outstanding fetches to any address
Memory Controllers • Each memory controller supports independent 64B transactions • Integrated DDR2 Memory controller ensures that L3 cache miss is resolved in less than 60 nanoseconds
TLB • Barcelona offers non-speculative memory access re-ordering in the form of Load Store Units (LSU) • Thus, some memory operations can be issued out-of-order • In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB • During this translation, the lower 12 bits of the load operation’s address are tested against previously stored addresses • If they are different, then the load proceeds ahead of the store • If they are the same, load-store forwarding occurs • Should a miss in the L1 DTLB occur, the L2 DTLB will be checked • Once the load or store has located address in the cache, the operation will move on to LSU2. • LSU2 holds up to 32 memory accesses, where they stay until they are removed • The LSU2 handles any cache or TLB misses via scheduling and probing • In the case of a cache miss, the LSU2 will then look in the L2, L3 and then memory • In the case of TLB misses, it will look in the L2 TLB and then main memory • The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness • Thus, the LSU2 reduces the majority of the complexity in the memory pipeline
Hypertransport • Barcelona has four HyperTransport 3.0 lanes for inter-processor communications and I/O devices • HyperTransport 3.0 adds a feature called ‘unganging’ or lane-splitting • The HT3.0 links are composed of two 16 bit lanes ( in both directions) • Each can be split up into a pair of independent 8-bit wide links
Shanghai • The latest model of the Opteron series • Several improvements over Barcelona • 45nm • 6MB L3 cache • Improved clock speeds • A host of other improvements