790 likes | 1.2k Views
RISC Processors. Chapter 14 S. Dandamudi. Introduction Evolution of CISC processors RISC design principles PowerPC processor Architecture Addressing modes Instruction set. Itanium processor Architecture Addressing modes Instruction set Instruction-level parallelism Branch handling
E N D
RISC Processors Chapter 14 S. Dandamudi
Introduction Evolution of CISC processors RISC design principles PowerPC processor Architecture Addressing modes Instruction set Itanium processor Architecture Addressing modes Instruction set Instruction-level parallelism Branch handling Speculative execution Outline S. Dandamudi
Introduction • CISC • Complex instruction set • Pentium is the most popular example • RISC • Simple instructions • Reduced complexity • Modern processors use this design philosophy • PowerPC, MIPS, SPARC, Intel Itanium • Borrow some features from CISC • No precise definition • We can identify some common characteristics S. Dandamudi
Evolution of CISC Designs • Motivation to efficiently use expensive resources • Processor • Memory • High density code • Complex instructions • Hardware complexity is handled by microprogramming • Microprogramming is also helpful to • Reduce the impact of memory access latency • Offers flexibility • Low-cost members of the same family • Tailored to high-level language constructs S. Dandamudi
Evolution of CISC Designs (cont’d) S. Dandamudi
Evolution of CISC Designs (cont’d) Example • Autoincrement addressing mode of VAX • Performs the following actions: (R2) = (R2) + R3; R2 = R2 + 1 • RISC equivalent R4 = (R2) R4 = R4 + R3 (R2) = R4 R2 = R2 + 1 S. Dandamudi
Why RISC? • Simple instructions are preferred • Complex instructions are mostly ignored by compilers • Due to semantic gap • Simple data structures • Complex data structures are used relatively infrequently • Better to support a few simple data types efficiently • Synthesize complex ones • Simple addressing modes • Complex addressing modes lead to variable length instructions • Lead to inefficient instruction decoding and scheduling S. Dandamudi
Why RISC? (cont’d) • Large register set • Efficient support for procedure calls and returns • Patterson and Sequin’s study • Procedure call/return: 12-15% of HLL statements • Constitute 31-33% of machine language instructions • Generate nearly half (45%) of memory references • Small activation record • Tanenbaum’s study • Only 1.25% of the calls have more than 6 arguments • More than 93% have less than 6 local scalar variables • Large register set can avoid memory references S. Dandamudi
RISC Design Principles • Simple operations • Simple instructions that can execute in one cycle • Register-to-register operations • Only load and store operations access memory • Rest of the operations on a register-to-register basis • Simple addressing modes • A few addressing modes (1 or 2) • Large number of registers • Needed to support register-to-register operations • Minimize the procedure call and return overhead S. Dandamudi
RISC Design Principles (cont’d) Register windows storing activation records S. Dandamudi
RISC Design Principles (cont’d) • Fixed-length instructions • Facilitates efficient instruction execution • Simple instruction format • Fixed boundaries for various fields • opcode, source operands,… • Other features • Tend to use Harvard architecture • Pipelining is visible at the architecture level S. Dandamudi
PowerPC • Registers • 32 general-purpose registers (GPR0 – GPR31) • 32 floating-point registers (FPR0 – FPR31) • Condition register (CR) • Similar to Pentium’s flags register • Divided into 8 CR fields (4 bits each) • “less than” (LT), “greater than” (GT), “equal to” (EQ), Overflow (SO) • CR1 is for floating-point exceptions • Other CR fields can be used for integer or FP exceptions • Branch instructions can test a specific CR field bit S. Dandamudi
PowerPC (cont’d) S. Dandamudi
PowerPC (cont’d) • XER register serves two distinct purposes • Bits 0, 1, and 2 are used to capture • Summary overflow (SO), overflow (OV), carry (CA) • OV and CA are similar to Pentium’s overflow and carry • SO, once set, only a special instruction can clear it • Bits 25 to 31 (7 bits) • Specifies the number of bytes to be transferred between memory and registers • Two instructions • Load string word indexed (lswx) • Store string word indexed (stswx) • Can load/store all 32 registers (GPR0-GPR31) S. Dandamudi
PowerPC (cont’d) • Link register (LR) • Used to store the procedure return address • Stores the effective address of the instruction following the procedure call instruction • Procedure calls use the branch instructions • Example: b = branch, bl = procedure call • Count register (CTR) • Maintains loop count value • Similar to Pentium's ECX register • Branch instructions can test the value • 32-bit PowerPC implementations use segmentation like the Pentium S. Dandamudi
PowerPC (cont’d) • Addressing modes • Load/store instructions support three addressing modes • Can use GPRs • Register Indirect • Effective address = contents of rA or 0 • Specifying 0 generates address 0 • Register Indirect with Immediate Index • Effective address = Contents of rA or 0 + imm16 • Register Indirect with Index • Effective address = Contents of rA or 0 + contents of rB S. Dandamudi
PowerPC (cont’d) Instruction format S. Dandamudi
PowerPC (cont’d) • Bits 0-5 • Specify primary opcode • Other fields specify suboperations • Depends on instruction type • AA bit • 1 (use absolute address) • 0 (use relative address) • LK bit • 0 (no link --- branch) • 1 (link --- turns branch into a procedure call) S. Dandamudi
PowerPC Instruction Set • Data Transfer instructions • Byte loads lbz rD,disp(rA) ;Load byte and zero lbzu rD,disp(rA) ;Load byte and zero ;with update • Effective address = contents of rA + disp lbzx rD,rA,rB ;Load byte and zero indexed lbzux rD,rA,rB ;Load byte and zero ;with update indexed • Effective address = contents of rA + contents of rB • Upper three bytes of rD are zeroed • Update versions: rA effective address S. Dandamudi
PowerPC Instruction Set (cont’d) • Similar instructions for halfword and word loads lhz, lhzu, lhzx, lhzxu lwz, lwzu, lwzx, lwzxu • For halfword loads, sign extension is possible lha, lhau, lhax, lhaxu • Multiword load lmw rD,disp(rA) • Loads n consecutive words at EA to registers rD, …, r31 S. Dandamudi
PowerPC Instruction Set (cont’d) • Similar instructions for store stbz, stbzu, stbzx, stbzxu sthz, sthzu, sthzx, sthzxu stwz, stwzu, stwzx, stwzxu • Multiword store stmw rD,disp(rA) • Stores n consecutive words at EA to registers rD, …, r31 S. Dandamudi
PowerPC Instruction Set (cont’d) Arithmetic Instructions • Add instructions add rD,rA,rB ; rD rA + rB • Status and overflow bits of CR0 and XER are not altered add. rD,rA,rB ; alters LT,GT,EQ,SO of CR0 addo rD,rA,rB ; alters SO,OV of XER addo. rD,rA,rB ; alters LT,GT,EQ,SO of CR0 ; and SO,OV of XER • These four instructions do not alter the CA bit of XER S. Dandamudi
PowerPC Instruction Set (cont’d) • To alter CA bit, use adde rD,rA,rB • To alter the other bits, use adde., addeo, addeo. • Immediate operand version addi rD,rA,Simm16 • We can use addi to implement other instructions li rD,value as addi rD,0,value la rD,disp(rA) as addi rD,rA,disp subi rD,rA,value as addi rD,rA,-value S. Dandamudi
PowerPC Instruction Set (cont’d) • Subtract instructions subf rD,rA,rB; rD rB - rA • subf = subtract from • Like add, other forms are available subf., subfo, subfo. • Negate instruction neg rD,rA; rD 0 - rA S. Dandamudi
PowerPC Instruction Set (cont’d) • Multiply instructions • Two instructions to get upper and lower 32 bits of the 64-bit result mullw rD,rA,rB ; signed/unsigned multiply • Stores the lower-order 32 bits of the result • Use the following to get the upper 32 bits mulhw rD,rA,rB ; signed mulhwu rD,rA,rB ; unsigned • Immediate form mulli rD,rA,Simm16 • Stores only lower 32 bits of the 48-bit result S. Dandamudi
PowerPC Instruction Set (cont’d) • Divide instructions • Two divide instructions • Signed (divw) divw rD,rA,rB ; rD = rA/rB • Unsigned (divwu) • Both give only quotient • For quotient and remainder, use divw rD,rA,rB ; quotient in rD mullw rX,rD,rB subf rC,rX,rA ; remainder in rC S. Dandamudi
PowerPC Instruction Set (cont’d) • Logical instructions and rD,rS,rB and. rD,rS,rB andi. rD,rS,Uimm16 andis. rD,rS,Uimm16 andc rD,rS,rB andc. rD,rS,rB • andis = left shift uimm16 by four positions before ANDing • andc = complement rB before ANDing • Dot versions update the LT, GT, EQ, SO bits of CR0 • Logical OR also has these six versions • Move register instruction is implemented using OR mr rA,RSis equivalent toor rA,rS,rS • NOP is implemented as ori 0,0,0 S. Dandamudi
PowerPC Instruction Set (cont’d) • Other logical operations • NAND • nand • nand. • NOR • nor • nor. • XOR • xor, xor. • xori, xoris • Equivalence (exclusive-NOR) • eqv • eqv. S. Dandamudi
PowerPC Instruction Set (cont’d) • Shift and Rotate instructions • Shift left slw rA,rS,rB; shift left word • Shift left the word in rS by rB positions and store result in rA • Shifted out bits get zeroes • Also have the dot version slw. • Shift right srw srw. (logical) sraw sraw. (arithmetic) • Rotate left instructions rlwnm rA,rS,rB,MB,ME rotlw rA,rS,rB rlwnm rA,rS,rB,0,31 S. Dandamudi
PowerPC Instruction Set (cont’d) • Compare instructions • Two versions: • For signed and unsigned • Two formats • Register and immediate • Register compare cmp crfD,rA,rB • Updates LT (rA < rB), GT (rA > rB), EQ, SO bits in the crfD • If crfD is not specified, CR0 is used • Immediate version cmp crfD,rA,Simm16 S. Dandamudi
PowerPC Instruction Set (cont’d) • Branch Instructions • Used for both branch (LK = 0) and procedure calls (LK = 1) • Can use absolute (AA = 1) or relative address (AA = 0) b target (AA=0, LK=0) Branch ba target (AA=1, LK=0) Branch Absolute bl target (AA=0, LK=1) Branch then link bla target (AA=1, LK=1) Branch Absolute then link • The last two are procedure calls • Three types of conditional branches • Direct address • Register indirect • CTR or LR S. Dandamudi
PowerPC Instruction Set (cont’d) • Conditional branch instructions (direct address) bc BO,BI,target (AA=0, LK=0) Branch Conditional bca BO,BI,target (AA=1, LK=0) Branch Conditional Absolute bcl BO,BI,target (AA=0, LK=1) Branch Conditional then link bcla BO,BI,target (AA=1, LK=1) Branch Conditional Absolute then link • BO = branch options (5 bits) specifies branch condition • BI = branch input (5 bits) specifies a bit in CR field S. Dandamudi
PowerPC Instruction Set (cont’d) • Nine different branch conditions can be specified • Decrement CTR; branch if CTR 0 AND cond = false • Decrement CTR; branch if CTR = 0 AND cond = false • Decrement CTR; branch if CTR 0 AND cond = true • Decrement CTR; branch if CTR = 0 AND cond = true • Branch if cond = false • Branch if cond = true • Decrement CTR; branch if CTR 0 • Decrement CTR; branch if CTR = 0 • Branch always S. Dandamudi
PowerPC Instruction Set (cont’d) • LR-based branch instructions bclr BO,BI (LK=0) Branch Conditional to Link Register bclrl BO,BI (LK=1) Branch Conditional to Link Register then Link • Target address is taken from LR • Used to return from procedure calls • CTR-based branch instructions bcctr BO,BI (LK=0) bcctrl BO,BI (LK=1) • CTR instead of LR is used to get target S. Dandamudi
Itanium • Intel’s 64-bit processor • RISC based • Based on EPIC design philosophy • Explicit Parallel Instruction Computing • Support for ILP • 3-instruction wide word • Speculative computation • Hides memory latency • Predication • Improves branch handling • Large number of registers • 128 integer and 128 FP • Aids in efficient procedure calls S. Dandamudi
Itanium (cont’d) S. Dandamudi
Itanium (cont’d) • Registers • 128 general purpose register (gr0 – gr127) • 64-bit wide • NaT (Not-a-Thing) bit • Used in speculative loading • Divided into static and stacked • Static • First 32 registers (gr0 – gr31) • gr0is read-only (always provides zero) • Stacked • Available for programs • Used as register stack frame S. Dandamudi
Itanium (cont’d) • Registers • Branch registers • 8 in total (br0 – br7) • 64-bit wide • Specify target address for • Conditional branches • Procedure calls • Return • User mask register • Alignment, byte ordering, … • Other registers • Predicate register, Application registers, Current frame marker S. Dandamudi
Itanium (cont’d) • Addressing modes • Load/store instructions can access memory • Specify three registers: r1, r2, r3 • r32 and r3 are used to compute effective address • r1 receives/supplies data • Register indirect addressing • Effective address = contents of r3 • Register indirect with immediate addressing • Effective address = contents of r3 + imm9 • r3 = Effective address • Register indirect with index addressing • Effective address = contents of r3 + contents of r2 • r3 = Effective address S. Dandamudi
Itanium (cont’d) • Instruction Format [(qp)] mnemonic[.comp] dests = srcs • qp = qualifying predicate • Specifies a predicate register • 64 1-bit registers • Executed if the specified PR is 1 • Otherwise, instruction is treated as NOP • mnemonic • Identifies an instruction (e.g., compare) • comp • Gives more information to completely specify instruction • E.g., Type of comparison is equality S. Dandamudi
Itanium (cont’d) S. Dandamudi
Itanium (cont’d) S. Dandamudi
Itanium (cont’d) • Examples add r1 = r2,r3 Predicate instruction (p4) add r1 = r2,r3 add r1 = r2,r3,1 Compare instructions cmp.eq p3 = r2,r4 cmp.gt p2,p3 = r3,r4 Branch instruction br.cloop.sptk loop_back S. Dandamudi
Instruction-level Parallelism • Itanium provides • Runtime support for explicit parallelism • Compiler/assembler can indicate parallelism • Instruction groups • Large number of registers • Instruction groups • Set of instructions that do not have conflicting dependencies • Can be executed in parallel • Compiler/assembler can indicate this by ;; notation S. Dandamudi
Instruction-level Parallelism • Example: Logical expression with four terms if (r10 || r11 || r12 || r13) { /* if-block code */ } can be done using or-tree evaluation or r1 = r10,r11 /* Group 1 */ or r2 = r12,r13 ;; or r3 = r1,r2 /* Group 2 */ Other instructions /* Group 3 */ • Processor can execute as many instructions from group as it can • Depends on the available resources S. Dandamudi
Itanium Instruction Bundle • Each instruction is encoded using 41 bits • Three instructions are bundled together • 128-bit Instruction bundle • No conflicting dependencies among the three instructions • Aids in instruction–level parallelism • 5-bit template • Specifies mapping of instruction slots to execution instruction types • Six instruction types • Integer ALU, non-ALU integer, memory, branch, FP, extended S. Dandamudi
Itanium Instructions • Data transfer instructions • Load and store instructions are more complicated than a typical RISC processor • Load instructions (qp) ldSZ.ldtype.ldhint r1=[r3] (qp) ldSZ.ldtype.ldhint r1=[r3],r2 (qp) ldSZ.ldtype.ldhint r1=[r3],imm9 • Loads SZ bytes from memory • SZ can be 1, 2, 4, or 8 to load 1, 2, 4, or 8 bytes • Example: ld8 r5 = [r6] Locality of memory access Special load operations: advanced, speculative S. Dandamudi
Itanium Instructions (cont’d) • ldtype • This completer can be used to specify special load operations • Advanced ld8.a r5 = [r6] • Speculative ld8.s r5 = [r6] • ldhint • Locality of memory access None – Temporal locality, level 1 nt 1 – No temporal locality, level 1 nt a – No temporal locality, all levels S. Dandamudi
Itanium Instructions (cont’d) • Store instructions • Simpler than load instructions (qp) stSZ.sttype.sthint r1=[r3] (qp) stSZ.sttype.sthint r1=[r3],imm9 • Move instructions (qp) mov r1 = r3 (qp) mov r1 = imm2 (qp) mov r1 = imm64 • First two are pseudo-instructions • Implemented using other processor instructions S. Dandamudi
Itanium Instructions (cont’d) • Arithmetic instructions • Simpler than load instructions (qp) add r1 = r2,r3 (qp) add r1 = r2,r3,1 (qp) add r1 = imm,r4 • Move instruction (qp) mov r1 = r3 implemented as (qp) add r1 = 0,r3 • Move instruction (qp) mov r1 = imm22 implemented as (qp) add r1 = imm22,r0 can be imm14 or imm22 S. Dandamudi