360 likes | 501 Views
Finishing out EECS 470. A few snapshots of the real world. Real processors: How they are different than your project . What we’ve talked about so far isn’t grounded by the real world in any meaningful way. That is, we haven’t really looked at how real processors do things
E N D
Finishing out EECS 470 A few snapshots of the real world
Real processors:How they are different than your project. • What we’ve talked about so far isn’t grounded by the real world in any meaningful way. • That is, we haven’t really looked at how real processors do things • Today we’ll look at two processors • We’ll start with a 2003 core from AMD • Lots of details available, close to your project • Jump to the latest Intel core. • Look at performance issue
AMD 64-bit coreMost taken fromhttp://www.chip-architect.com/
Bit-interleaved busses running “North-South”
IntegerDecode/Dispatch • 3 types of instructions • Direct path • RISC-like • Vector path • Broken into smaller instructions via micro code. • Double • 128-bit instructions which can be broken into 2 64-bit independent instructions are (called Double) • Others are done via microcode • Most 128-bit SSE and SSE2 are made into doubles.
RS • Each cycle an instruction is issued into one of 3 lanes. • Each lane has • 8 RSs • 1 ALU • 1 AGU (Address Generation Unit) • Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.
Rename • Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB) • 72 in-flight instructions are kept in the RoB • The other structure is the IFFRF: Integer Future File and Register File • 16 registers of committed state • 16 “future registers” • 8 scratch-pad registers
Future file • In the P6 scheme we had to look 3 places for the data • The PRF • The RoB • The CDB (later) • Here we look in the FF or the CDB-like-things later. • The FF holds the speculative value if it is known. • At execution complete instructions check to see if they were the last thing to dispatch that writes to a given physical register. • This is done by tagging the FF with the RoB number. • If they were the last to have that AR as a destination, they update the FF.
How does the • At issue we: • Check the FF for source operands • Reserve a spot in the RoB • Place our tag (RoB number) in the FF • Mark the FF entry as invalid • At EX complete we: • Send RoB number and data to the CDB • Send data to the RoB • Update FF if tag matches • At retire • update ARF value (from RoB) • At mispredict • Copy ARF value into FF.
What did the FF buy us? • P6-like advantages • No free-list for PRF • Can just clear the RAT on mis-predict. • But no need to access the RoB looking for data • RoB data only written once (EX complete) and only read once (Commit) • Some pain • Early branch resolution looks hard
ROB • It uses an 8-bit descriptor for 72 entries.
1) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value 0..23 that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.
More on the RoB • What is basically happening is that we have three RoBs • Each one size 24 • We cycle through each one so that none get ahead of the other. • Reduces read/write ports!
Mispredictions • It looks like they wait until retirement to resolve all exceptions. • Mispredictions are treated as exceptions! • They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF
More details. • Each x86 instruction can launch both an ALU and an AGU operation • Because x86 has lots of memory operations this makes sense. • ALUs broadcast result tag one cycle early • So RS can launch data to the ALU before data arrives.
Lane 8
Intel’s Haswell • Latest Intel microarchtecture • 22nm process • 4-wide OoO processor • x86 • An evolution, not revolution • Very similar to architectures from the last 8 years. • http://www.anandtech.com/show/6355/intels-haswell-architecture
Basics • Converts x86 instructions into microops • RISC-like instructions • Even more basic than RISC in some cases • Loads and Stores generally turn into two instructions • Address compute and memory access
What’s interesting? • Seeing how things have changed compared to previous microarchitectures • Transactional support • Power issues
Buffer sizes • 192 RoB entries • 60 RS • 72 Loads • 42 stores
Other key features • Transactional synchronization • Execute lock-protected section • Don’t acquire lock • If someone else is doing the same thing at the same time • Undo all memory accesses • Do again with locks. • Why? • New sleep states • More like handheld devices.
Microarchitecture and performance voidtightloop() { unsigned j; for (j = 0; j < N; ++j) counter += j; } voidfoo() { } voidloop_with_extra_call() { unsigned j; for (j = 0; j < N; ++j) { __asm__("call foo"); counter += j; } } tightloop() runs in .68 sec loop_with_extra_call runs in .60 sec Why http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly/
0000000000400530 <tightloop>: 400530: xor %eax,%eax 400532: nopw 0x0(%rax,%rax,1) 400538: mov 0x200b01(%rip),%rdx # 601040 <counter> 40053f: add %rax,%rdx 400542: add $0x1,%rax 400546: cmp $0x17d78400,%rax 40054c: mov %rdx,0x200aed(%rip) # 601040 <counter> 400553: jne 400538 <tightloop+0x8> 400555: repzretq 400557: nopw 0x0(%rax,%rax,1) 0000000000400560 <foo>: 400560: repzretq 0000000000400570 <loop_with_extra_call>: 400570: xor %eax,%eax 400572: nopw 0x0(%rax,%rax,1) 400578: callq 400560 <foo> 40057d: mov 0x200abc(%rip),%rdx # 601040 <counter> 400584: add %rax,%rdx 400587: add $0x1,%rax 40058b: cmp $0x17d78400,%rax 400591: mov %rdx,0x200aa8(%rip) # 601040 <counter> 400598: jne 400578 <loop_with_extra_call+0x8> 40059a: repzretq 40059c: nopl 0x0(%rax)