Last lecture

Last lecture Some misc. stuffAn older real processor Class review/overview.

Misc. Status issues • HW5 • Answers posted • Returned on Wednesday (next week) • Project presentation signup at http://tinyurl.com/470W14talks • Locations are all over (see link) • Need to be there for whole time slot. • Exam review etc. • Q&A session • 1-2:45pm on Wednesday 2/23 • Office hours • see calendar.

Stuff still to do • Oral report • Don’t forget to be there for the whole hour • PowerPoint or other slides • Either bring portable or USB stick • Written report • Due 9pm Tuesday via e-mail. • Exam 4/25 • 1:30-3:30pm • This room (1670 BBB)

AMD 64-bit coreMost taken fromhttp://www.chip-architect.com/

Bit-interleaved busses running “North-South”

IntegerDecode/Dispatch • 3 types of instructions • Direct path • RISC-like • Vector path • Broken into smaller instructions via micro code. • Double • 128-bit instructions which can be broken into 2 64-bit independent instructions are (called Double) • Others are done via microcode • Most 128-bit SSE and SSE2 are made into doubles.

RS • Each cycle an instruction is issued into one of 3 lanes. • Each lane has • 8 RSs • 1 ALU • 1 AGU (Address Generation Unit) • Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.

Rename • Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB) • 72 in-flight instructions are kept in the RoB • The other structure is the IFFRF: Integer Future File and Register File • 16 registers of committed state • 16 “future registers” • 8 scratch-pad registers

Future file • In the P6 scheme we had to look 3 places for the data • The PRF • The RoB • The CDB (later) • Here we look in the FF or the CDB-like-things later. • The FF holds the speculative value if it is known. • At execution complete instructions check to see if they were the last thing to dispatch that writes to a given physical register. • This is done by tagging the FF with the RoB number. • If they were the last to have that AR as a destination, they update the FF.

How do we use the FF? • At issue we: • Check the FF for source operands • Reserve a spot in the RoB • Place our tag (RoB number) in the FF • Mark the FF entry as invalid • At EX complete we: • Send RoB number and data to the CDB • Send data to the RoB • Update FF if tag matches • At retire • update ARF value (from RoB) • At mispredict • Copy ARF value into FF.

What did the FF buy us? • P6-like advantages • No free-list for PRF • Can just clear the RAT on mis-predict. • But no need to access the RoB looking for data • RoB data only written once (EX complete) and only read once (Commit) • Some pain • Early branch resolution looks hard

ROB: An 8-bit descriptor for 72 entries 1) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value 0..23 that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.

More on the RoB • What is basically happening is that we have three RoBs • Each one size 24 • We cycle through each one so that none get ahead of the other. • Reduces read/write ports!

Mispredictions • It looks like they wait until retirement to resolve all exceptions. • Mispredictions are treated as exceptions! • They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF

More details. • Each x86 instruction can launch both an ALU and an AGU operation • Because x86 has lots of memory operations this makes sense. • ALUs broadcast result tag one cycle early • So RS can launch data to the ALU before data arrives.

Lane 8

Class summary • Major topics • ILP in hardware (Out-of-order processors) • How they work AND why we use them • Caches and Virtual Memory • Multi-processor • ILP in software (Complier, IA-64) • Power • Less major topics • Memory disambiguation • Branch prediction • Direction and target • Advanced OoO issues • Superscalar, instruction scheduling, multi-threading, etc.

The big questions • What is computer architecture? • What are the metrics of performance? • What are the techniques we use to maximize these metrics?

ILP in hardware (1/2) • ILP definitions • Hazards vs dependencies • Data, Name and Control dependencies • What ILP means and finding it. • Dynamic Scheduling • Tomasulo’s (three versions!) • You can be promised a question on this! • Branch Prediction • Local, global, hybrid/correlating • Tournament and gshare • BTBs

ILP in hardware (2/2) • Multiple Issue • Static • Static Superscalar • VLIW • Dynamic superscalar • Speculation • Branch, data • ILP limit studies

ILP in hardware: Questions • True or False • The original T-algorithm only allows reordering within basic blocks • In P6, if it weren’t for precise interrupts, it would be okay to retire instructions out-of-order as long as they had finished executing and a branch isn’t skipped over. • ILP in hardware is limited in scope due to the “instruction window” which is basically the size of the RS.

Quick idea: SMT • One processor, two threads.

Caching (1/2) • There is a huge amount of stuff associated with caching. The important stuff • Locality • Temporal/Spatial • 3’Cs model • Stack distance model • Nuts-and-bolts • Replacement policies (LRU, pseudo-LRU) • Performance (hit rate, Thit; Tmiss, average access time) • Write back/Write thru • Block size • Basic improvement • Multi-level cache • Critical word first • Write buffers

Caching (2/2) • Non-standard caches • Hash • Victim • Skew • Misc. • Virtual addresses and caching • Impact of prefetching • Latency hiding with OO execution

Cache: Questions (1/2) • Changing __________ has an impact on compulsory misses. • A victim cache is more likely to help with ________ than ________ though it can help both (3’Cs) • At least _____ bits are required to keep exact track of LRU in a 5-way associative cache.

Cache question (2/2) • A ____________ cache has a number of sets equal to the number of lines in the cache. • A fully-associative cache with N lines will miss an access that has a stack distance of ________ (state the largest range you can).

Multi-processor • Amdahl’s law as it applies to MP. • Bus-based multi-processor • Snooping • MESI • Bus transaction types (BRL etc.) • Distributed-shared • Directory schemes • Synchronization • Critical sections • Spin-locks

Multi-processor: Question • Under the MESI protocol what is the advantage of having a distinct clean and dirty exclusive state?

Software techniques for ILP (1/2) • Pipeline scheduling • Reordering instructions in a basic block to remove pipe stalls • Loop unrolling • Static information passed to processor • Static branch prediction • Static dependence information • Loop issues • Detecting loop dependencies • Software pipelining

Software techniques for ILP (2/2) • Global code scheduling • Predicated instruction and CMOV • Memory reference speculation • Issues with preserving exception behavior • IA-64 as a case study of hardware support for software ILP techniques • Speculative loads • Advanced loads • Software pipelining optimizations

Software techniques for ILP: Questions • What is the most significant disadvantage of loop unrolling? • Using CMOV re-write the following code snippet, removing the branch. Don’t change exception behavior and assume DIV only causes an exception if R3=0 BNE R1 R2 skip R1=R2/R3 skip: nop

Power • Understand why it’s important • Power vs. Energy • How it’s related to the existence of multi-core • Understand voltage scaling issues

Last lecture

Last lecture

Presentation Transcript

Last Lecture

Last Lecture

Last Lecture:

Last Lecture

Last Lecture:

Last Lecture

Last lecture

Last Lecture

Last Lecture:

Last lecture

Last lecture

Last Lecture

Last Lecture:

Last Lecture

Last Lecture:

Last Lecture:

Last Lecture

Last Lecture