1 / 134

Understanding CPU Performance Factors

This chapter explores the various performance factors of a CPU, such as instruction count, CPI, and cycle time, determined by ISA and compiler. It delves into two MIPS implementations, including a simplified and more realistic pipelined version. Key topics include memory reference, arithmetic/logical operations, and control transfer mechanisms. The chapter also covers instruction execution steps, multiplexers, logic design basics, and clocking methodology within a CPU. A detailed explanation of building a datapath, R-format instructions, load/store instructions, branch instructions, and composing CPU elements is provided.

jewelld
Download Presentation

Understanding CPU Performance Factors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4 The Processor

  2. 簡介Introduction §4.1 Introduction • CPU 效能因素(performance factors) • 指令數(Instruction count) • 由ISA及編譯器決定(Determined by ISA and compiler) • 指令週期數 & 週期時間(CPI and Cycle time) • 由CPU硬體決定(Determined by CPU hardware) • 兩種MIPS實作( two MIPS implementations) • 簡易版(A simplified version) • 實際管線版(A more realistic pipelined version) • 檢單子集Simple subset, shows most aspects • 記憶題存取(Memory reference): lw, sw • 算術邏輯(Arithmetic/logical): add, sub, and, or, slt • 控制轉移(Control transfer): beq, j Chapter 4 — The Processor — 2

  3. 執行指令Instruction Execution • 擷取指令(PC  instruction memory, fetch instruction) • 暫存器(Register numbers register file, read registers) • 與指令類別有關(Depending on instruction class) • 計算(Use ALU to calculate) • Arithmetic result • 記憶體存取(Memory address for load/store) • Access data memory for load/store • 下一指令(Branch target address) • PC  target address or PC + 4 Chapter 4 — The Processor — 3

  4. CPU Overview Chapter 4 — The Processor — 4

  5. 多工器Multiplexers • 不能將線直接連接:多工器 • Can’t just join wires together • Use multiplexers 程式計數器 指令記憶體 暫存器檔案 資料記憶體 Chapter 4 — The Processor — 5

  6. 控制線路Control Chapter 4 — The Processor — 6

  7. 邏輯設計Logic Design Basics • 二元資訊Information encoded in binary • Low voltage = 0, High voltage = 1 • One wire per bit • Multi-bit data encoded on multi-wire buses • 組合邏輯Combinational element • 資料運算Operate on data • 輸出入函數關係Output is a function of input • 序向邏輯State (sequential) elements • Store information §4.2 Logic Design Conventions Chapter 4 — The Processor — 7

  8. A Y B A A Mux I0 Y + Y Y I1 ALU B B S F Combinational Elements • AND-gate • Y = A & B • Adder • Y = A + B • Arithmetic/Logic Unit • Y = F(A, B) • Multiplexer • Y = S ? I1 : I0 Chapter 4 — The Processor — 8

  9. D Q Clk Clk D Q Sequential Elements • 暫存器(Register): stores data in a circuit • 利用clock信號決定何時更新資料(Uses a clock signal to determine when to update the stored value) • 邊緣觸發Clk改變(上升緣)觸發(Edge-triggered: update when Clk changes from 0 to 1) Chapter 4 — The Processor — 9

  10. Clk D Q Write Write D Clk Q Sequential Elements • 暫存器具寫入控制(Register with write control) • 當Clk上升緣且Write為High時才寫入(Only updates on clock edge when write control input is 1) • Used when stored value is required later Chapter 4 — The Processor — 10

  11. Clocking Methodology • 組合邏輯在時脈週期轉換資料(Combinational logic transforms data during clock cycles) • 位準觸發(Between clock edges) • 輸出入皆為序向元件(Input from state elements, output to state element) • 最常延遲決定時脈寬度(Longest delay determines clock period) Chapter 4 — The Processor — 11

  12. 建立資料路徑Building a Datapath • 資料路徑(Datapath) • CPU中處理資料及位址之元件(Elements that process data and addressesin the CPU) • 暫存器、ALU、多工器(Registers, ALUs, mux’s,memories, …) • 建構MIPS資料路徑(Build a MIPS datapath incrementally) • Refining the overview design §4.3 Building a Datapath Chapter 4 — The Processor — 12

  13. Instruction Fetch Increment by 4 for next instruction 32-bit register 下一指令 PC+4 32位元暫存器 Chapter 4 — The Processor — 13

  14. R格式指令(R-Format Instructions) • 讀取兩個傳存器運算元(Read two register operands) • 執行算術邏輯運算(Perform arithmetic/logical operation) • 寫回傳暫存器結果(Write register result) 5條輸入線表示有25=32個暫存器 Chapter 4 — The Processor — 14

  15. Load/Store Instructions • 讀取暫存器運算元(Read register operands) • 使用16bit偏移計算位址(Calculate address using 16-bit offset) • Use ALU, but sign-extend offset • Load: Read memory and update PCregister • Store: Write register value to memory 0110  0000 0110 1010  1111 1010 Chapter 4 — The Processor — 15

  16. 分支指令Branch Instructions • 讀取運算元(Read register operands) • 比較運算元(Compare operands) • 使用ALU、相減並檢查零輸出(Use ALU, subtract and check Zero output) • 計算目標位址(Calculate target address) • 符號擴展置換(Sign-extend displacement) • 位移2個位置(字組位移)Shift left 2 places (word displacement) • PC值加4(Add to PC + 4) • 指令擷取時已計算Already calculated by instruction fetch Chapter 4 — The Processor — 16

  17. Branch Instructions Justre-routes wires Sign-bit wire replicated Chapter 4 — The Processor — 17

  18. Composing the Elements • 將指令的資料路徑切在一個時脈週期內(First-cut data path does an instruction in one clock cycle) • 任一資歷路徑元件只能值一次行一個功能(Each datapath element can only do one function at a time) • 將指令與資料記憶體分開(separate instruction and data memories) • 使用多工器選擇不同指令的資料來源(Use multiplexers where alternate data sources are used for different instructions) Chapter 4 — The Processor — 18

  19. R-Type/Load/Store DatapathR型/載入/儲存指令 資料路徑 add $t1, $t1, $s6 addi $s3, $s3, 1 lw $t0, 32($s3) sw $t0, 48($s3) Chapter 4 — The Processor — 19

  20. Full Datapath(完整資料路徑) add $t1, $t1, $s6 addi $s3, $s3, 1 lw $t0, 32($s3) sw $t0, 48($s3) Chapter 4 — The Processor — 20

  21. ALU控制ALU Control • ALU之用途(ALU used for) • Load/Store: F = add • Branch: F = subtract • R-type: F depends on funct field(功能欄) §4.4 A Simple Implementation Scheme Chapter 4 — The Processor — 21

  22. ALU控制ALU Control • 假設2位元ALUOp 從opcode中取得(Assume 2-bit ALUOp derived from opcode) • Combinational logic derives ALU control Functl功能欄只有在R型指令時才有作用 Chapter 4 — The Processor — 22

  23. 0 4 35 or 43 rs rs rs rt rt rt rd address address shamt funct 31:26 31:26 31:26 25:21 25:21 25:21 20:16 20:16 20:16 15:11 10:6 15:0 15:0 5:0 主要控制單元(The Main Control Unit) • Control signals derived from instruction R-type Load/Store Branch opcode always read read, except for load write for R-type and load sign-extend and add 寫入RegR、Load 運算碼 一定要讀 一定要讀除了load 符號擴展、加 Chapter 4 — The Processor — 23

  24. 資料路徑+控制(Datapath With Control) Chapter 4 — The Processor — 24

  25. R型指令(R-Type Instruction) Chapter 4 — The Processor — 25

  26. R型指令(R-Type Instruction) • 執行指令 add $t1, $t2, $t3 • 執行步驟 • 1.擷取指令,PC值+4 • 2.從暫存器檔案中讀出$t2,$t3;同時,CU解碼指令(計算所需之控制訊號線之值) • 3.ALU根據功能碼(funct)對$t2,$t3做運算 • 4.ALU結果寫入目的暫存器 Chapter 4 — The Processor — 26

  27. 35 or 43 rs rt address 31:26 25:21 20:16 15:0 載入指令(Load Instruction) lw $t0, 32($s3) ; 3519832 Chapter 4 — The Processor — 27

  28. 載入指令(Load Instruction) • 執行指令 lw $t0, 32($s3) • 執行步驟: • 1.擷取指令,PC值+4 • 2.從暫存器檔案中讀出$s3; • 3.ALU計算$s3與經過符號延伸之32之和 • 4. ALU計算結果為資料記憶體的輸入位址 • 5.資料記憶體傳回之值寫入暫存器檔案($t0) Chapter 4 — The Processor — 28

  29. 4 rs rt address 31:26 25:21 20:16 15:0 (條件分支指令)Branch-on-Equal Instruction beq $s1, $s2, 100 ; 4171825 Chapter 4 — The Processor — 29

  30. (條件分支指令)Branch-on-Equal Instruction • 執行指令 beq $s0, $s1, 100 • 執行步驟 • 1.擷取指令,PC值+4 • 2.從暫存器檔案中讀出$s0, $s1 • 3.ALU執行減法, (PC+4) 值與經過符號延伸之並左移2位之值相加之和(i.e.分支目的位址) • 4. ALU之Zero輸出決定哪一個加法器結果寫回PC Chapter 4 — The Processor — 30

  31. 2 address 31:26 25:0 絕對跳躍指令(Implementing Jumps) • Jump uses word address • Update PC with concatenation of • Top 4 bits of old PC • 26-bit jump address • 002 • 需要從運算碼(opcode)多一個額外控制訊號(Need an extra control signal decoded from opcode) Jump Chapter 4 — The Processor — 31

  32. 跳躍指令增加之資料路徑Datapath With Jumps Added Chapter 4 — The Processor — 32

  33. 效能議題Performance Issues • Longest delay determines clock period • 關鍵路徑:Load指令(Critical path: load instruction) • 指令記憶體暫存器檔案 ALU資料記憶體傳存器檔案(Instruction memory  register file  ALU  data memory  register file) • 對不同指令沒有彈性可以改變週期(Not feasible to vary period for different instructions) • 違反設計原則Violates design principle • 讓一般情況加快(Making the common case fast) • 利用管線處理來增進效能(We will improve performance by pipelining) Chapter 4 — The Processor — 33

  34. 管線之比喻Pipelining Analogy • 管線式洗衣(Pipelined laundry: overlapping execution) • 平行處理增進效能(Parallelism improves performance) §4.5 An Overview of Pipelining • Four loads: • Speedup= 8/3.5 = 2.3 • Non-stop: • Speedup= 2n/0.5n + 1.5 ≈ 4= number of stages Chapter 4 — The Processor — 34

  35. MIPS管線處理(MIPS Pipeline) • 管線處理(pipelining)之定義 • 將指令區分成數個步驟,分別由不同的功能單元同時加以執行,以增進整體程式之效能 • 管線處理五個步驟(Five stages, one step per stage) • IF: Instruction Fetch(擷取指令)從記憶體擷取指令(Instruction fetch from memory) • ID: Instruction Decode(指令解碼)解碼指令並讀取暫存器(Instruction decode & register read) • EX: Execution(執行指令)執行運算或計算位址(Execute operation or calculate address) • MEM: Memory Access(記憶體存取)存取記憶體運算元(Access memory operand) • WB: Write Back(寫回)將結果寫回至暫存器(Write result back to register) Chapter 4 — The Processor — 35

  36. 管線處理效能Pipeline Performance • 假設每一階段的時間為(Assume time for stages is) • 暫存器讀寫:100ps for register read or write • 其他階段:200ps for other stages • 比較管線處理與單一週期的資料路徑(Compare pipelined datapath with single-cycle datapath) Chapter 4 — The Processor — 36

  37. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 37

  38. 管線處理加速Pipeline Speedup • 所有階段都一致If all stages are balanced • 每一階段時間都相同(i.e., all take the same time) • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • 若階段不一致,加速值較少If not balanced, speedup is less • 管線處理的「加速』源自增加處理量(產量)Speedup due to increased throughput • 延遲時間(latency每一指令的時間)沒有減少Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 38

  39. Pipelining and ISA Design • MIPSISA專為管線化處理所設計(MIPS ISA designed for pipelining) • 所有指令皆為32位元(All instructions are 32-bits) • 較容易在一個週期內擷取並解碼(Easier to fetch and decode in one cycle) • c.f. x86: 1- to 17-byte instructions • 少量且規則的指令格式(Few and regular instruction formats) • 能在一個步驟內解碼並讀取暫存器(Can decode and read registers in one step) • 載入與儲存定址(Load/store addressing) • 能在第3階段計算位址,在第4階段存犬記憶體(Can calculate address in 3rd stage, access memory in 4th stage) • 記憶體運算元對齊(Alignment of memory operands) • 記憶體存取只需一個週期(Memory access takes only one cycle) Chapter 4 — The Processor — 39

  40. 危障Hazards • 下一週期的起始位址不是下一指令Situations that prevent starting the next instruction in the next cycle • 危障種類Hazard types • 結構危障(Structure hazards) • 所需資源忙碌中(A required resource is busy) • 資料危障(Data hazard) • 等待前一指令完成資料讀寫Need to wait for previous instruction to complete its data read/write • 控制危障(Control hazard) • 依前一指令結果決定控制動作Deciding on control action depends on previous instruction Chapter 4 — The Processor — 40

  41. 結構危障Structure Hazards • 使用資源衝突Conflict for use of a resource • MIPS中只有一個記憶體In MIPS pipeline with a single memory • Load/store 需要做資料存取Load/store requires data access • 該週期的指令擷取必須延遲(stall),需管線泡泡Instruction fetch would have to stallfor that cycleWould cause a pipeline “bubble” • 管線式資料路徑需要獨立的指令/資料記憶體 • 或獨立的指令/資料快取(記憶體)Hence, pipelined datapaths require separate instruction/data memories • Or separate instruction/data caches Chapter 4 — The Processor — 41

  42. 資料危障Data Hazards • 與前依指令資料存取完成結果有關An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 Chapter 4 — The Processor — 42

  43. 前饋Forwarding (aka Bypassing) • 使用已經計算完成的結果Use result when it is computed • 不需等到存至暫存器中Don’t wait for it to be stored in a register • 資料路徑需要額外的連接線Requires extra connections in the datapath Chapter 4 — The Processor — 43

  44. Load指令-資料危障(Load-Use Data Hazard) • 使用前饋仍無法避免要用延遲(stall)Can’t always avoid stalls by forwarding • 當所需要的值尚未計算完成If value not computed when needed • 無法前饋至之前的時間Can’t forward backward in time! Chapter 4 — The Processor — 44

  45. Code Scheduling to Avoid Stalls • 利用指令重排避免下一 指令為Load指令Reorder code toavoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 45

  46. 控制危障Control Hazards • 分支決定控制流程Branch determines flow of control • 擷取下一指令取決於分支結果Fetching next instruction depends on branch outcome • 管線處理不能永遠擷取正確的下一個指令Pipeline can’t always fetch correct instruction • 仍在分支指令的ID階段Still working on ID stage of branch • MIPS的管線處理中In MIPS pipeline • 需要在管線中比較暫存器與提早計算目標位址Need to compare registers and compute target early in the pipeline • 在ID階段增加硬體來處理Add hardware to do it in ID stage Chapter 4 — The Processor — 46

  47. 分支中的延遲Stall on Branch • 等到分支結果來決定擷取下一指令Wait until branch outcome determined before fetching next instruction Chapter 4 — The Processor — 47

  48. 分支預測Branch Prediction • 較長的管線無法完全提早決定分之結果Longer pipelines can’t readily determine branch outcome early • 延遲時間變得無法接受Stall penalty becomes unacceptable • 分支預測結果Predict outcome of branch • 預測錯誤只有造成延遲Only stall if prediction is wrong • MIPS管線處理中In MIPS pipeline • 可以預測分支未發生Can predict branches not taken • 分支後的擷取指令沒有延遲Fetch instruction after branch, with no delay Chapter 4 — The Processor — 48

  49. MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 49

  50. More-Realistic Branch Prediction • 靜態分支預測Static branch prediction • 基於典型分支行為Based on typical branch behavior • 範例:迴圈與if指令Example: loop and if-statement branches • 預測反向分支會發生Predict backward branches taken • 預測前向分支不會發生Predict forward branches not taken • 動態分支預測Dynamic branch prediction • 硬體測量實際分支行為Hardware measures actual branch behavior • 例如:記錄每一分支最近結果的歷史e.g., record recent history of each branch • 假設未來行為會持續趨勢Assume future behavior will continue the trend • 當猜錯時,使用重新擷取時延遲stall、並更新歷史紀錄When wrong, stall while re-fetching, and update history Chapter 4 — The Processor — 50

More Related