CPU Continued

CPU Continued

Pentium II • Fetch/Decode Unit • pulls instructions from the cache in order predicted by BTB. • 3 decoders to break instruction into uops (micro instructions). • 2 restricted decoders, 1 general decoder • If more than four uops, instruction is sent to MIS unit • can generate 6 uops/clock cycle

ReOrder Buffer (ROB) - instruction pool • circular buffer. • Handles register renaming (40) • Reservation Station • passes uops to available execution units • acts as buffer --> stores up to 20 uops and data

Dispatch/Execute • checks each uop in ROB to make sure it can be processed • searches L1 D-cache then L2 cache • Instead of being idle, execute unit looks at each uop in ROB until it finds one it can execute. - speculative execution.

After uop is processed, execution has to compare it to what the BTB predicted. If BTB has failed, then Jump Execution Unit, moves pointer in ROB appropriately. • BTB updates so that next time it can predict better.

At same time, the Retirement Unit is inspecting the ROB to see if the uop at the head of the ROB has been executed. If so, it checks 2nd and 3rd as well. Finally, it sends all 3 to store buffer. • While in store buffer, results checked before sent to L2 cache.’

Pipelining • decompose sequential process into subprocesses • performed by collection of processing segments, output of one is input to another. • Clock line is used to connect all registers and is pulsed after enough time has passed.

Example - A[I] x B[I] + C[I], I = 1, 2, …7 • suboperations: Segment 1: R1  A[ i ], R2  B[ i ] Input A[ i ] and B[ i ] Segment 2: R3  R1 x R2, R4  C[ i ] Multiply and input C[ i ] Segment 3: R5  R3 + R4 Add C[ i ] to product

After 3rd clock pulse, results start being output. • Potential problem: empty pipeline • P2 and P3 have 12 stage pipelines in the integer ALUs

Branch Prediction • pipelining works best on linear code • programs are not linear code only if (i == 0) CMP i, 0 :compare i to 0 k = 1; BNE else :branch to else if not equal else then: MOV k, 1 :move 1 to k k = 2; BR next :unconditional branch else: MOV k,2 :move 2 to k next: code fragment same code fragment in generic assembly language

fetch occurs before know if instruction is a branch and which branch to take. • Delay slot - instruction after a branch -- NOPs • stall pipeline v.s. prediction • secret registers • rollback • solution is complex

Superscalar Architecture • program steps - sequential in nature • superscalar provides 2 or more execution paths.

Pentium has 2 five-stage pipelines (u and v) • P2 and P3 - single pipeline with multiple Functional • Units

Out of Order Execution • difficult to divide workload equally • dependencies cause trouble - RAW, WAR • Results are held in buffers until all out of order instructions finish and results are put back together

Register Renaming • to avoid register access problems, secret registers are used • these are dynamically allocated and not visible to programmers • so any register reference has to be switched to the secret registers and then the chip has to remember which is used by which

RISC • John Cocke 80/20 rule • Optimize the 20% instructions and then redo the 80% as combinations of the 20%

RISC characteristics • single-cycle or better execution of instructions • Uniformity of instructions (same length) • Lack of Microcode - hardwired logic • software does work of optimizing • Design Simplicity

Micro-Ops and CISC/RISC computers • CISC executed on RISC • uops or AMDs R-ops

Single Instruction - Multiple Data • single microprocessor, multiple data bytes • 64-bit in MMX • Streaming SIMD (SSE) • augmented with 70 new instructions • Katmai New Instructions

Operating Modes • real mode • protected mode • most commonly used • flat memory model, demand paging, no addressing limits • virtual 8086 mode

Microprocessor Competition • Moore’s Law -- “Microprocessor power double’s every 18 months” • micron sized parts (expect miniaturization wall in 2017) • other methods • To keep getting better, either switch media or improve processing power through techniques such as parallel processing.

AMD • Advanced Micro Devices, 1969 • cross-licensing agreement with Intel • With 386, Intel separated from partnership • AMD built own 386 • Intel sued, AMD won

AMD chips • K5 (586) • RISC core • socket compatible with Pentium • six stage pipeline • matches P133

K6, K6-2 (with 3D Now!) • fits Intel’s Socket 7 • core built by NexGen • 2 six stage pibpelines • RISC • no 32-bit instructions, but no slowdown • P-rating --> rate speed: actual speed

K6-3 (Sharptooth) • 64KB L1, 256 KB L2, 1MB L3 -- Tri-Level Cache • seven state pipelines (2) • performance equivalent to P3 at one speed higher (450:400 in P-rating) • 21.3 x 106 transistors

K7 • nine instruction superscalar design • 3 parallel decoders • 3 integer pipelines, 3 multimedia • 128K L1, 64 bit back-side bus • 512K - 8 M L2 cache

Cyrix Corporation • MII • MMX pipeline -- 10 stages • Jalapeno -- 3D graphics, new FPU.

IDT/Centaur • WinChip C6 • Intel compatible • owned by Integrated Device Technology

Intel Itanium • 64 - bit RISC • dual mode - IA 32 and IA 64 • instructions processed in bundles • compilers break instructions apart and arrange scheduling (not CPU)

branching done using predication • predicate registers help with conditions for instructions

CPU Continued

CPU Continued

Presentation Transcript

Continued……

CPU

CPU Performance Pipelined CPU

CPU

TOBI, continued (continued)

CPU

CPU 排程 (CPU Scheduling)

Continued…

Continued..

CPU

CPU

CPU

CPU