Understanding CPU Performance Factors

Chapter 4 The Processor

簡介Introduction §4.1 Introduction • CPU 效能因素(performance factors) • 指令數(Instruction count) • 由ISA及編譯器決定(Determined by ISA and compiler) • 指令週期數 & 週期時間(CPI and Cycle time) • 由CPU硬體決定(Determined by CPU hardware) • 兩種MIPS實作( two MIPS implementations) • 簡易版(A simplified version) • 實際管線版(A more realistic pipelined version) • 檢單子集Simple subset, shows most aspects • 記憶題存取(Memory reference): lw, sw • 算術邏輯(Arithmetic/logical): add, sub, and, or, slt • 控制轉移(Control transfer): beq, j Chapter 4 — The Processor — 2

執行指令Instruction Execution • 擷取指令(PC  instruction memory, fetch instruction) • 暫存器(Register numbers register file, read registers) • 與指令類別有關(Depending on instruction class) • 計算(Use ALU to calculate) • Arithmetic result • 記憶體存取(Memory address for load/store) • Access data memory for load/store • 下一指令(Branch target address) • PC  target address or PC + 4 Chapter 4 — The Processor — 3

CPU Overview Chapter 4 — The Processor — 4

多工器Multiplexers • 不能將線直接連接：多工器 • Can’t just join wires together • Use multiplexers 程式計數器指令記憶體暫存器檔案資料記憶體 Chapter 4 — The Processor — 5

控制線路Control Chapter 4 — The Processor — 6

邏輯設計Logic Design Basics • 二元資訊Information encoded in binary • Low voltage = 0, High voltage = 1 • One wire per bit • Multi-bit data encoded on multi-wire buses • 組合邏輯Combinational element • 資料運算Operate on data • 輸出入函數關係Output is a function of input • 序向邏輯State (sequential) elements • Store information §4.2 Logic Design Conventions Chapter 4 — The Processor — 7

A Y B A A Mux I0 Y + Y Y I1 ALU B B S F Combinational Elements • AND-gate • Y = A & B • Adder • Y = A + B • Arithmetic/Logic Unit • Y = F(A, B) • Multiplexer • Y = S ? I1 : I0 Chapter 4 — The Processor — 8

D Q Clk Clk D Q Sequential Elements • 暫存器(Register): stores data in a circuit • 利用clock信號決定何時更新資料(Uses a clock signal to determine when to update the stored value) • 邊緣觸發Clk改變(上升緣)觸發(Edge-triggered: update when Clk changes from 0 to 1) Chapter 4 — The Processor — 9

Clk D Q Write Write D Clk Q Sequential Elements • 暫存器具寫入控制(Register with write control) • 當Clk上升緣且Write為High時才寫入(Only updates on clock edge when write control input is 1) • Used when stored value is required later Chapter 4 — The Processor — 10

Clocking Methodology • 組合邏輯在時脈週期轉換資料(Combinational logic transforms data during clock cycles) • 位準觸發(Between clock edges) • 輸出入皆為序向元件(Input from state elements, output to state element) • 最常延遲決定時脈寬度(Longest delay determines clock period) Chapter 4 — The Processor — 11

建立資料路徑Building a Datapath • 資料路徑(Datapath) • CPU中處理資料及位址之元件(Elements that process data and addressesin the CPU) • 暫存器、ALU、多工器(Registers, ALUs, mux’s,memories, …) • 建構MIPS資料路徑(Build a MIPS datapath incrementally) • Refining the overview design §4.3 Building a Datapath Chapter 4 — The Processor — 12

Instruction Fetch Increment by 4 for next instruction 32-bit register 下一指令 PC+4 32位元暫存器 Chapter 4 — The Processor — 13

R格式指令(R-Format Instructions) • 讀取兩個傳存器運算元(Read two register operands) • 執行算術邏輯運算(Perform arithmetic/logical operation) • 寫回傳暫存器結果(Write register result) 5條輸入線表示有25=32個暫存器 Chapter 4 — The Processor — 14

Load/Store Instructions • 讀取暫存器運算元(Read register operands) • 使用16bit偏移計算位址(Calculate address using 16-bit offset) • Use ALU, but sign-extend offset • Load: Read memory and update PCregister • Store: Write register value to memory 0110  0000 0110 1010  1111 1010 Chapter 4 — The Processor — 15

分支指令Branch Instructions • 讀取運算元(Read register operands) • 比較運算元(Compare operands) • 使用ALU、相減並檢查零輸出(Use ALU, subtract and check Zero output) • 計算目標位址(Calculate target address) • 符號擴展置換(Sign-extend displacement) • 位移2個位置(字組位移)Shift left 2 places (word displacement) • PC值加4(Add to PC + 4) • 指令擷取時已計算Already calculated by instruction fetch Chapter 4 — The Processor — 16

Branch Instructions Justre-routes wires Sign-bit wire replicated Chapter 4 — The Processor — 17

Composing the Elements • 將指令的資料路徑切在一個時脈週期內(First-cut data path does an instruction in one clock cycle) • 任一資歷路徑元件只能值一次行一個功能(Each datapath element can only do one function at a time) • 將指令與資料記憶體分開(separate instruction and data memories) • 使用多工器選擇不同指令的資料來源(Use multiplexers where alternate data sources are used for different instructions) Chapter 4 — The Processor — 18

R-Type/Load/Store DatapathR型/載入/儲存指令 資料路徑 add $t1, $t1, $s6 addi $s3, $s3, 1 lw $t0, 32($s3) sw $t0, 48($s3) Chapter 4 — The Processor — 19

Full Datapath(完整資料路徑) add $t1, $t1, $s6 addi $s3, $s3, 1 lw $t0, 32($s3) sw $t0, 48($s3) Chapter 4 — The Processor — 20

ALU控制ALU Control • ALU之用途(ALU used for) • Load/Store: F = add • Branch: F = subtract • R-type: F depends on funct field(功能欄) §4.4 A Simple Implementation Scheme Chapter 4 — The Processor — 21

ALU控制ALU Control • 假設2位元ALUOp 從opcode中取得(Assume 2-bit ALUOp derived from opcode) • Combinational logic derives ALU control Functl功能欄只有在R型指令時才有作用 Chapter 4 — The Processor — 22

0 4 35 or 43 rs rs rs rt rt rt rd address address shamt funct 31:26 31:26 31:26 25:21 25:21 25:21 20:16 20:16 20:16 15:11 10:6 15:0 15:0 5:0 主要控制單元(The Main Control Unit) • Control signals derived from instruction R-type Load/Store Branch opcode always read read, except for load write for R-type and load sign-extend and add 寫入RegR、Load 運算碼一定要讀一定要讀除了load 符號擴展、加 Chapter 4 — The Processor — 23

資料路徑+控制(Datapath With Control) Chapter 4 — The Processor — 24

R型指令(R-Type Instruction) Chapter 4 — The Processor — 25

R型指令(R-Type Instruction) • 執行指令 add $t1, $t2, $t3 • 執行步驟 • 1.擷取指令，PC值+4 • 2.從暫存器檔案中讀出$t2,$t3;同時，CU解碼指令(計算所需之控制訊號線之值) • 3.ALU根據功能碼(funct)對$t2,$t3做運算 • 4.ALU結果寫入目的暫存器 Chapter 4 — The Processor — 26

35 or 43 rs rt address 31:26 25:21 20:16 15:0 載入指令(Load Instruction) lw $t0, 32($s3) ； 3519832 Chapter 4 — The Processor — 27

載入指令(Load Instruction) • 執行指令 lw $t0, 32($s3) • 執行步驟： • 1.擷取指令，PC值+4 • 2.從暫存器檔案中讀出$s3; • 3.ALU計算$s3與經過符號延伸之32之和 • 4. ALU計算結果為資料記憶體的輸入位址 • 5.資料記憶體傳回之值寫入暫存器檔案($t0) Chapter 4 — The Processor — 28

4 rs rt address 31:26 25:21 20:16 15:0 (條件分支指令)Branch-on-Equal Instruction beq $s1, $s2, 100 ； 4171825 Chapter 4 — The Processor — 29

(條件分支指令)Branch-on-Equal Instruction • 執行指令 beq $s0, $s1, 100 • 執行步驟 • 1.擷取指令，PC值+4 • 2.從暫存器檔案中讀出$s0, $s1 • 3.ALU執行減法， (PC+4) 值與經過符號延伸之並左移2位之值相加之和(i.e.分支目的位址) • 4. ALU之Zero輸出決定哪一個加法器結果寫回PC Chapter 4 — The Processor — 30

2 address 31:26 25:0 絕對跳躍指令(Implementing Jumps) • Jump uses word address • Update PC with concatenation of • Top 4 bits of old PC • 26-bit jump address • 002 • 需要從運算碼(opcode)多一個額外控制訊號(Need an extra control signal decoded from opcode) Jump Chapter 4 — The Processor — 31

跳躍指令增加之資料路徑Datapath With Jumps Added Chapter 4 — The Processor — 32

效能議題Performance Issues • Longest delay determines clock period • 關鍵路徑：Load指令(Critical path: load instruction) • 指令記憶體暫存器檔案 ALU資料記憶體傳存器檔案(Instruction memory  register file  ALU  data memory  register file) • 對不同指令沒有彈性可以改變週期(Not feasible to vary period for different instructions) • 違反設計原則Violates design principle • 讓一般情況加快(Making the common case fast) • 利用管線處理來增進效能(We will improve performance by pipelining) Chapter 4 — The Processor — 33

管線之比喻Pipelining Analogy • 管線式洗衣(Pipelined laundry: overlapping execution) • 平行處理增進效能(Parallelism improves performance) §4.5 An Overview of Pipelining • Four loads: • Speedup= 8/3.5 = 2.3 • Non-stop: • Speedup= 2n/0.5n + 1.5 ≈ 4= number of stages Chapter 4 — The Processor — 34

MIPS管線處理(MIPS Pipeline) • 管線處理(pipelining)之定義 • 將指令區分成數個步驟，分別由不同的功能單元同時加以執行，以增進整體程式之效能 • 管線處理五個步驟(Five stages, one step per stage) • IF: Instruction Fetch(擷取指令)從記憶體擷取指令(Instruction fetch from memory) • ID: Instruction Decode(指令解碼)解碼指令並讀取暫存器(Instruction decode & register read) • EX: Execution(執行指令)執行運算或計算位址(Execute operation or calculate address) • MEM: Memory Access(記憶體存取)存取記憶體運算元(Access memory operand) • WB: Write Back(寫回)將結果寫回至暫存器(Write result back to register) Chapter 4 — The Processor — 35

管線處理效能Pipeline Performance • 假設每一階段的時間為(Assume time for stages is) • 暫存器讀寫：100ps for register read or write • 其他階段：200ps for other stages • 比較管線處理與單一週期的資料路徑(Compare pipelined datapath with single-cycle datapath) Chapter 4 — The Processor — 36

Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 37

管線處理加速Pipeline Speedup • 所有階段都一致If all stages are balanced • 每一階段時間都相同(i.e., all take the same time) • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • 若階段不一致，加速值較少If not balanced, speedup is less • 管線處理的「加速』源自增加處理量(產量)Speedup due to increased throughput • 延遲時間(latency每一指令的時間)沒有減少Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 38

Pipelining and ISA Design • MIPSISA專為管線化處理所設計(MIPS ISA designed for pipelining) • 所有指令皆為32位元(All instructions are 32-bits) • 較容易在一個週期內擷取並解碼(Easier to fetch and decode in one cycle) • c.f. x86: 1- to 17-byte instructions • 少量且規則的指令格式(Few and regular instruction formats) • 能在一個步驟內解碼並讀取暫存器(Can decode and read registers in one step) • 載入與儲存定址(Load/store addressing) • 能在第3階段計算位址，在第4階段存犬記憶體(Can calculate address in 3rd stage, access memory in 4th stage) • 記憶體運算元對齊(Alignment of memory operands) • 記憶體存取只需一個週期(Memory access takes only one cycle) Chapter 4 — The Processor — 39

危障Hazards • 下一週期的起始位址不是下一指令Situations that prevent starting the next instruction in the next cycle • 危障種類Hazard types • 結構危障(Structure hazards) • 所需資源忙碌中(A required resource is busy) • 資料危障(Data hazard) • 等待前一指令完成資料讀寫Need to wait for previous instruction to complete its data read/write • 控制危障(Control hazard) • 依前一指令結果決定控制動作Deciding on control action depends on previous instruction Chapter 4 — The Processor — 40

結構危障Structure Hazards • 使用資源衝突Conflict for use of a resource • MIPS中只有一個記憶體In MIPS pipeline with a single memory • Load/store 需要做資料存取Load/store requires data access • 該週期的指令擷取必須延遲(stall)，需管線泡泡Instruction fetch would have to stallfor that cycleWould cause a pipeline “bubble” • 管線式資料路徑需要獨立的指令/資料記憶體 • 或獨立的指令/資料快取(記憶體)Hence, pipelined datapaths require separate instruction/data memories • Or separate instruction/data caches Chapter 4 — The Processor — 41

資料危障Data Hazards • 與前依指令資料存取完成結果有關An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 Chapter 4 — The Processor — 42

前饋Forwarding (aka Bypassing) • 使用已經計算完成的結果Use result when it is computed • 不需等到存至暫存器中Don’t wait for it to be stored in a register • 資料路徑需要額外的連接線Requires extra connections in the datapath Chapter 4 — The Processor — 43

Load指令-資料危障(Load-Use Data Hazard) • 使用前饋仍無法避免要用延遲(stall)Can’t always avoid stalls by forwarding • 當所需要的值尚未計算完成If value not computed when needed • 無法前饋至之前的時間Can’t forward backward in time! Chapter 4 — The Processor — 44

Code Scheduling to Avoid Stalls • 利用指令重排避免下一指令為Load指令Reorder code toavoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 45

控制危障Control Hazards • 分支決定控制流程Branch determines flow of control • 擷取下一指令取決於分支結果Fetching next instruction depends on branch outcome • 管線處理不能永遠擷取正確的下一個指令Pipeline can’t always fetch correct instruction • 仍在分支指令的ID階段Still working on ID stage of branch • MIPS的管線處理中In MIPS pipeline • 需要在管線中比較暫存器與提早計算目標位址Need to compare registers and compute target early in the pipeline • 在ID階段增加硬體來處理Add hardware to do it in ID stage Chapter 4 — The Processor — 46

分支中的延遲Stall on Branch • 等到分支結果來決定擷取下一指令Wait until branch outcome determined before fetching next instruction Chapter 4 — The Processor — 47

分支預測Branch Prediction • 較長的管線無法完全提早決定分之結果Longer pipelines can’t readily determine branch outcome early • 延遲時間變得無法接受Stall penalty becomes unacceptable • 分支預測結果Predict outcome of branch • 預測錯誤只有造成延遲Only stall if prediction is wrong • MIPS管線處理中In MIPS pipeline • 可以預測分支未發生Can predict branches not taken • 分支後的擷取指令沒有延遲Fetch instruction after branch, with no delay Chapter 4 — The Processor — 48

MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 49

More-Realistic Branch Prediction • 靜態分支預測Static branch prediction • 基於典型分支行為Based on typical branch behavior • 範例：迴圈與if指令Example: loop and if-statement branches • 預測反向分支會發生Predict backward branches taken • 預測前向分支不會發生Predict forward branches not taken • 動態分支預測Dynamic branch prediction • 硬體測量實際分支行為Hardware measures actual branch behavior • 例如：記錄每一分支最近結果的歷史e.g., record recent history of each branch • 假設未來行為會持續趨勢Assume future behavior will continue the trend • 當猜錯時，使用重新擷取時延遲stall、並更新歷史紀錄When wrong, stall while re-fetching, and update history Chapter 4 — The Processor — 50

Understanding CPU Performance Factors

Understanding CPU Performance Factors

Presentation Transcript

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4-4

Chapter 4

Chapter 4

Chapter 4 - 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

Chapter 4