1 / 38

Intel and FAER’s Reach to Teach

Intel and FAER’s Reach to Teach. A Program on Computer Architecture. Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute of Science, Bangalore. Pipelined Processor Architecture. 1. Terminology and assumptions

lloyd
Download Presentation

Intel and FAER’s Reach to Teach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel and FAER’s Reach to Teach A Program on Computer Architecture Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute of Science, Bangalore

  2. Pipelined Processor Architecture 1. Terminology and assumptions 2. Review: Computer organization; Data representation 3. Pipelined processor architecture 4. ILP (Instruction Level Parallelism) processor architecture

  3. What is Computer Architecture? architecture in the English dictionary • art and science of designing and building habitable structures • structures → computer systems • inhabitants → computer programs • a structure, or structures collectively • a style and method of design and construction (e.g., Moghul architecture) • The study of computer structures; design, evaluation, description

  4. Computer Architect vs Computer Designer vs Logic Designer • Computer Architect develops the Instruction Set Architecture (ISA: description of instructions which are allowed and semantics of what each instruction does when executed) and computer system architecture • Computer Designer develops detailed machine organization (blocks, specifications, testing) • Logic Designer implements these blocks

  5. Basics: Computer Organization CPU REGISTERS General Purpose • Integer Registers • FP Registers Special Purpose • Program Counter • Stack Pointer • Link Register • Instruction Register ALU Registers Memory CU MMU Cache Bus I/O I/O I/O

  6. Basics: Laws, Principles, Rules speedup Amdahl’s Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of time the slower mode is used

  7. Principle of Locality of Reference • A program property; programs tend to reuse instructions and data • 90-10 rule: 90% of execution time spent in 10% of code • Temporal locality: recently accessed things are likely to be accessed in near future • Spatial locality: things whose addresses are close in space tend to be accessed close together in time

  8. General Principle of Locality • Denning SJCC 1972, Blevin & Ramamurthy IEEE Trans Comp 1976 • During any interval of time, resource demands are non-uniformly distributed • Correlation between immediate past and immediate future resource demand patterns tends to be high, and correlation between disjoint resource demand patterns tends to 0 as the distance between them tends to infinity Correlation Direction and strength of linear relationship between 2 random variables

  9. 1984 1982 1986 1987 1991 1993 1994 1995 1998 1980 1981 1983 1985 1988 1989 1990 1992 1996 1997 1999 2000 `Moore’s Law’ µProc 60%/yr. (2X/1.5yr) 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 Memory 9%/yr. (2X/10 yrs) DRAM 1 Time

  10. Background: Data Representation Binary, bit, Byte Commonly used representations are: Character data: ASCII code Signed Integer data: 2s complement 1s complement, sign-magnitude Real data: Floating point Example: IEEE single precision floating point standard

  11. 2s Complement Representation The n bit quantity least significant bit represents the signed integer value

  12. IEEE Floating Point Representation 32 bit value (s, f, e), where f is a 23 bit fraction and e an 8 bit exponent, evaluates to Normalized form Special forms (zero, infinity, NaN, denormals)

  13. Instruction Set Architecture • Description of machine from view of the programmer/compiler • Example: Intel x86 ISA • Includes specification of • The different kinds of instructions available (instruction set) • How operands are specified (addressing modes) • What each instruction looks like (instruction format)

  14. Kinds of Instructions • Arithmetic/logical instructions • Add, subtract, multiply, divide, compare (int/fp) • Or, and, not, xor • Shift (left/right, arithmetic/logical), rotate • Data transfer instructions • Load (Move data value to a register from memory) • Store (Move data value to memory location from register) • Move • Control transfer instructions • Jump, conditional branch, function call, return • Other instructions • Example: halt

  15. Operand Addressing Modes • Operands to an instruction • Source: input value to instruction • Destination: where result is to go • Addressing Mode • How the location of operand is specified • An operand can be either • in a memory location • in a register

  16. Addressing Modes How the location of operands is specified • Register Direct - in a register, add R1, R2, R3 • Immediate - part of the instrn, add R1, R1, #4 • Register indirect - in memory, register specifying the address of memory, add R1, R2, (R3) • Base-Displacement - memory addr. is sum of base (reg.) and offset, add R1, 8(R3) • Absolute - memory addr. specified in instrn • Indexed - addr. is sum of base + index • Others (Auto increment/decrement, PC relative)

  17. Terms: Byte addressable • Memory: A sequence of locations, each containing some information referenced by an address • Address Space • Memory address space, Register address space • Addressability: how much data in a location? Example: In byte-addressable memory, each location contains 8 bits (1 byte) • Word: data in a set of contiguous locations • Word Length:Maximum data accessed in a single fetch

  18. 1A C8 B2 46 F0 8C 1E DF Terms: Byte ordering, Alignment Data (in hex) Word at 400 • Big Endian byte ordering 1AC8 B246 • Little Endian byte ordering 46B2 C81A Word aligned: at a word boundary Word at 400 is word aligned Word at 402 is not, but it is short-word aligned 400 402 404 406 Address (in dec) Decimal: 449,360,454 0001 1010 1100 1000 1011 0010 0100 0110 0100 0110 1011 0010 1100 1000 0001 1010 Decimal: 1,186,121,754

  19. ISA Example: MIPS32 ISA • Registers: 32 integer GPRs (R0,R1,…,R31) • R0 is hardwired to 0 • R31 is implicitly used by jal instruction • HI and LO: Special purpose registers used implicitly by multiply and divide instructions • Addressing modes • Register direct • Base displacement (by loads and stores) • Immediate • Absolute (by jump instructions) • PC relative (by branch instructions)

  20. Instruction Mnemonic Example Meaning Data Transfer Instructions Load LB, LBU, LH, LHU, lw R2, 4(R3) R2 Mem[R3+4] ß LUI, LW Store SB, SH, SW sb R2, - 8(R4) Mem[R4 - 8] R2 ß Int. ALU Instructions Add ADD,ADDI,ADDIU add R1, R2, R3 R1 R2 + R3 ß Subtra ct SUB, SUBU sub R1, R2, R3 R1 R2 – R3 ß Multiply MULT, MULTU mult R1, R2 LO LSW ( R1*R2) ß HI MSW (R1*R2) ß Divide DIV, DIVU div R1, R2 LO R1 div R2 ß HI R1 mod R2 ß Logical AND,ANDI,OR,ORI ori R1, R2, #0xF0 R1 R1 | SE (F0) ß NOR, XOR, XORI MIPS32 ISA. Shift SLL, SLV, SRA, SR sr R1, R2, #4 R1 0000 || (R2) ß 31 - 4 Comparison SLT, SLTI, SLTU slti R1, R2, #16 R1 1 if R2 < SE(16) ß 0 otherwise ß

  21. Instruction Mnemonic Example Meaning Control Transfer Instructions Conditional BEQ, BGEZ, BLTZ, bltz R2, - 16 PC PC – 12 if R2 < 0 ß Branch BLEZ, BGTZ, BNE Jump J, JR j <target> PC (PC) ||target||00 ß 31 - 28 Jump & Link JAL, JALR jalr R2 R31 PC + 8 ß PC R2 ß System Call SYSCALL syscall MIPS32 ISA.. Notation we will use for instructions: Opcode Destination, Source1, Source2 Example: ADD R1, R2, R3 ADD R1 ← R2, R3

  22. Steps in Instruction Processing Program Counter • Fetch the instruction from memory • Get instruction whose address is in PC from memory into IR • Increment PC • Decode the instruction • Understand instruction, addressing modes, etc • Calculate effective addresses of the operands to the instruction and fetch the operand values • Execute the instruction • Do the required operation • Write back the result of the instruction Instruction Register

  23. Timeline of events Processor/Memory Speed disparity: 2-3 orders of magnitude PC to memory Write result Instruction in IR Op done PC++; Decode Op2 fetched Op1 eff add calc Op2 eff add calc Op1 fetched

  24. Assumptions • Activity is overlapped in time where possible. • PC increment and instruction fetch? • Instruction decode and effective address calc? • Load-store ISA: the only instructions that take operands from memory are loads & stores • Main memory delays not typically seen by instruction processor • Cache memories (more on this in a later lecture) • Register file with 2 read ports and 1 write port

  25. Processor cycle time: time required to do • Cache memory access • Register access + some logic (like decode) • ALU operation • Instruction can be processed in 3-5 cycles • Jump: IFetch, Decode/OpFetch, DoOp • ALU: IFetch, Decode/OpFetch, DoOp, WriteReg • Load: IFetch, Decode, EffAddr, Cache, WriteReg

  26. Performance of Processor • Which is more important? • execution time of an instruction, or • throughput of instruction execution (number of instructions executed per unit time) • Cycles per instruction (CPI) • In our example, CPI between 3 and 5 • Objective of Pipelining • To improve CPI; make it close to 1

  27. Steps in Instruction Processing IF • Instruction Fetch: instruction is fetched from memory and PC is incremented • Instruction Decode: instruction is decoded and register operands fetched • Execute if arithmetic operation. Else, calculate effective address • Memory operation: if Load/Store, do memory access • Write back computed value to destination register ID EX MEM WB

  28. IF IF IF IF ID ID ID ID EX EX EX EX MEM MEM MEM MEM WB WB WB WB Pipelining time • Instruction execution time: 5 cycles • Instruction execution throughput: 1 instruction per cycle • It may not always be possible for instructions to progress through the pipeline in this way

  29. Pipeline Hazards Hazard: a situation that prevents the next instruction of the program from executing during its designated clock cycle • Structural hazard: Happens due to request for the same hardware resource by 2 or more instructions at the same time • Data hazard: Happens when one instruction depends on the result of previous instruction that is still in the pipeline • Control hazard: Happens due to control transfer instructions

  30. i + 1 i + 2 MEM & IF need to use memory IF IF IF ID ID ID EX EX EX MEM MEM MEM WB WB WB i + 3 B B B B IF i + 3 1. Structural Hazards i LW R3 ← mem [8(R2)] IF ID EX MEM WB B

  31. i add R3 ← R1, R2 i + 1 sub R4 ← R3, R8 IF IF ID ID EX EX MEM MEM WB WB i + 1 ID EX MEM WB B B B B B B B B i + 1 ID EX MEM WB 2. Data Hazards

  32. A Data Hazard Solution • Interlock: Hardware that detects data dependency and stalls dependent instructions time instr 0 1 2 3 4 5 6 ADD IF ID EX MEM WB SUB IF stall stall ID EX MEM OR stall stall IF ID EX

  33. IF ID EX MEM WB IF ID EX MEM IF ID EX Another Data Hazard Solution • Forwarding or Bypassing: forward the result as soon as available to EX add R3 ← R1, R2 sub R5 ← R3, R4 or R7 ← R3, R6

  34. Other Data Hazards Solutions • Delayed loads • Require that instruction that uses load value be separated from the load instruction • Instruction Scheduling • Reorder instructions so that dependent instructions are far enough apart • Compile time vs run time instruction scheduling

  35. Before Scheduling: LW R3 ← 0(R1) ADDI R5 ← R3, #1 ADD R2 ← R2, R3 LW R13 ← 0(R11) ADD R12 ← R13, R3 After Scheduling: LW R3 ← 0(R1) LW R13 ← 0(R11) ADDI R5 ← R3, #1 ADD R2 ← R2, R3 ADD R12 ← R13, R3 1 stall 1 stall Instruction Scheduling 0 stalls 2 stalls (following load)

  36. Branch condition & target resolved here IF ID EX MEM WB B B B B IF ID EX MEM B B B B B IF ID EX IF ID 3. Control Hazards BEQZ R3, out Fetch instrn. (i +1) or from target? Fetch instrn. (i +1) or from target? Branch resolved; appropriate instruction correctly fetched

  37. Lecture Summary • Computer architecture is the study of computer structures; design, evaluation, description • It builds on a background of computer organization, the study of how data can be represented and manipulated • Pipelined processors improve program execution time (instruction execution throughput) by overlapping in time the execution of many instructions

  38. Next Week • Instruction Level Parallelism (ILP) and how it is exploited by current processors to improve program execution time even more

More Related