Intel and FAER’s Reach to Teach

Intel and FAER’s Reach to Teach A Program on Computer Architecture Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute of Science, Bangalore

Pipelined Processor Architecture 1. Terminology and assumptions 2. Review: Computer organization; Data representation 3. Pipelined processor architecture 4. ILP (Instruction Level Parallelism) processor architecture

What is Computer Architecture? architecture in the English dictionary • art and science of designing and building habitable structures • structures → computer systems • inhabitants → computer programs • a structure, or structures collectively • a style and method of design and construction (e.g., Moghul architecture) • The study of computer structures; design, evaluation, description

Computer Architect vs Computer Designer vs Logic Designer • Computer Architect develops the Instruction Set Architecture (ISA: description of instructions which are allowed and semantics of what each instruction does when executed) and computer system architecture • Computer Designer develops detailed machine organization (blocks, specifications, testing) • Logic Designer implements these blocks

Basics: Computer Organization CPU REGISTERS General Purpose • Integer Registers • FP Registers Special Purpose • Program Counter • Stack Pointer • Link Register • Instruction Register ALU Registers Memory CU MMU Cache Bus I/O I/O I/O

Basics: Laws, Principles, Rules speedup Amdahl’s Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of time the slower mode is used

Principle of Locality of Reference • A program property; programs tend to reuse instructions and data • 90-10 rule: 90% of execution time spent in 10% of code • Temporal locality: recently accessed things are likely to be accessed in near future • Spatial locality: things whose addresses are close in space tend to be accessed close together in time

General Principle of Locality • Denning SJCC 1972, Blevin & Ramamurthy IEEE Trans Comp 1976 • During any interval of time, resource demands are non-uniformly distributed • Correlation between immediate past and immediate future resource demand patterns tends to be high, and correlation between disjoint resource demand patterns tends to 0 as the distance between them tends to infinity Correlation Direction and strength of linear relationship between 2 random variables

1984 1982 1986 1987 1991 1993 1994 1995 1998 1980 1981 1983 1985 1988 1989 1990 1992 1996 1997 1999 2000 `Moore’s Law’ µProc 60%/yr. (2X/1.5yr) 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 Memory 9%/yr. (2X/10 yrs) DRAM 1 Time

Background: Data Representation Binary, bit, Byte Commonly used representations are: Character data: ASCII code Signed Integer data: 2s complement 1s complement, sign-magnitude Real data: Floating point Example: IEEE single precision floating point standard

2s Complement Representation The n bit quantity least significant bit represents the signed integer value

IEEE Floating Point Representation 32 bit value (s, f, e), where f is a 23 bit fraction and e an 8 bit exponent, evaluates to Normalized form Special forms (zero, infinity, NaN, denormals)

Instruction Set Architecture • Description of machine from view of the programmer/compiler • Example: Intel x86 ISA • Includes specification of • The different kinds of instructions available (instruction set) • How operands are specified (addressing modes) • What each instruction looks like (instruction format)

Kinds of Instructions • Arithmetic/logical instructions • Add, subtract, multiply, divide, compare (int/fp) • Or, and, not, xor • Shift (left/right, arithmetic/logical), rotate • Data transfer instructions • Load (Move data value to a register from memory) • Store (Move data value to memory location from register) • Move • Control transfer instructions • Jump, conditional branch, function call, return • Other instructions • Example: halt

Operand Addressing Modes • Operands to an instruction • Source: input value to instruction • Destination: where result is to go • Addressing Mode • How the location of operand is specified • An operand can be either • in a memory location • in a register

Addressing Modes How the location of operands is specified • Register Direct - in a register, add R1, R2, R3 • Immediate - part of the instrn, add R1, R1, #4 • Register indirect - in memory, register specifying the address of memory, add R1, R2, (R3) • Base-Displacement - memory addr. is sum of base (reg.) and offset, add R1, 8(R3) • Absolute - memory addr. specified in instrn • Indexed - addr. is sum of base + index • Others (Auto increment/decrement, PC relative)

Terms: Byte addressable • Memory: A sequence of locations, each containing some information referenced by an address • Address Space • Memory address space, Register address space • Addressability: how much data in a location? Example: In byte-addressable memory, each location contains 8 bits (1 byte) • Word: data in a set of contiguous locations • Word Length:Maximum data accessed in a single fetch

1A C8 B2 46 F0 8C 1E DF Terms: Byte ordering, Alignment Data (in hex) Word at 400 • Big Endian byte ordering 1AC8 B246 • Little Endian byte ordering 46B2 C81A Word aligned: at a word boundary Word at 400 is word aligned Word at 402 is not, but it is short-word aligned 400 402 404 406 Address (in dec) Decimal: 449,360,454 0001 1010 1100 1000 1011 0010 0100 0110 0100 0110 1011 0010 1100 1000 0001 1010 Decimal: 1,186,121,754

ISA Example: MIPS32 ISA • Registers: 32 integer GPRs (R0,R1,…,R31) • R0 is hardwired to 0 • R31 is implicitly used by jal instruction • HI and LO: Special purpose registers used implicitly by multiply and divide instructions • Addressing modes • Register direct • Base displacement (by loads and stores) • Immediate • Absolute (by jump instructions) • PC relative (by branch instructions)

Instruction Mnemonic Example Meaning Data Transfer Instructions Load LB, LBU, LH, LHU, lw R2, 4(R3) R2 Mem[R3+4] ß LUI, LW Store SB, SH, SW sb R2, - 8(R4) Mem[R4 - 8] R2 ß Int. ALU Instructions Add ADD,ADDI,ADDIU add R1, R2, R3 R1 R2 + R3 ß Subtra ct SUB, SUBU sub R1, R2, R3 R1 R2 – R3 ß Multiply MULT, MULTU mult R1, R2 LO LSW ( R1*R2) ß HI MSW (R1*R2) ß Divide DIV, DIVU div R1, R2 LO R1 div R2 ß HI R1 mod R2 ß Logical AND,ANDI,OR,ORI ori R1, R2, #0xF0 R1 R1 | SE (F0) ß NOR, XOR, XORI MIPS32 ISA. Shift SLL, SLV, SRA, SR sr R1, R2, #4 R1 0000 || (R2) ß 31 - 4 Comparison SLT, SLTI, SLTU slti R1, R2, #16 R1 1 if R2 < SE(16) ß 0 otherwise ß

Instruction Mnemonic Example Meaning Control Transfer Instructions Conditional BEQ, BGEZ, BLTZ, bltz R2, - 16 PC PC – 12 if R2 < 0 ß Branch BLEZ, BGTZ, BNE Jump J, JR j <target> PC (PC) ||target||00 ß 31 - 28 Jump & Link JAL, JALR jalr R2 R31 PC + 8 ß PC R2 ß System Call SYSCALL syscall MIPS32 ISA.. Notation we will use for instructions: Opcode Destination, Source1, Source2 Example: ADD R1, R2, R3 ADD R1 ← R2, R3

Steps in Instruction Processing Program Counter • Fetch the instruction from memory • Get instruction whose address is in PC from memory into IR • Increment PC • Decode the instruction • Understand instruction, addressing modes, etc • Calculate effective addresses of the operands to the instruction and fetch the operand values • Execute the instruction • Do the required operation • Write back the result of the instruction Instruction Register

Timeline of events Processor/Memory Speed disparity: 2-3 orders of magnitude PC to memory Write result Instruction in IR Op done PC++; Decode Op2 fetched Op1 eff add calc Op2 eff add calc Op1 fetched

Assumptions • Activity is overlapped in time where possible. • PC increment and instruction fetch? • Instruction decode and effective address calc? • Load-store ISA: the only instructions that take operands from memory are loads & stores • Main memory delays not typically seen by instruction processor • Cache memories (more on this in a later lecture) • Register file with 2 read ports and 1 write port

Processor cycle time: time required to do • Cache memory access • Register access + some logic (like decode) • ALU operation • Instruction can be processed in 3-5 cycles • Jump: IFetch, Decode/OpFetch, DoOp • ALU: IFetch, Decode/OpFetch, DoOp, WriteReg • Load: IFetch, Decode, EffAddr, Cache, WriteReg

Performance of Processor • Which is more important? • execution time of an instruction, or • throughput of instruction execution (number of instructions executed per unit time) • Cycles per instruction (CPI) • In our example, CPI between 3 and 5 • Objective of Pipelining • To improve CPI; make it close to 1

Steps in Instruction Processing IF • Instruction Fetch: instruction is fetched from memory and PC is incremented • Instruction Decode: instruction is decoded and register operands fetched • Execute if arithmetic operation. Else, calculate effective address • Memory operation: if Load/Store, do memory access • Write back computed value to destination register ID EX MEM WB

IF IF IF IF ID ID ID ID EX EX EX EX MEM MEM MEM MEM WB WB WB WB Pipelining time • Instruction execution time: 5 cycles • Instruction execution throughput: 1 instruction per cycle • It may not always be possible for instructions to progress through the pipeline in this way

Pipeline Hazards Hazard: a situation that prevents the next instruction of the program from executing during its designated clock cycle • Structural hazard: Happens due to request for the same hardware resource by 2 or more instructions at the same time • Data hazard: Happens when one instruction depends on the result of previous instruction that is still in the pipeline • Control hazard: Happens due to control transfer instructions

i + 1 i + 2 MEM & IF need to use memory IF IF IF ID ID ID EX EX EX MEM MEM MEM WB WB WB i + 3 B B B B IF i + 3 1. Structural Hazards i LW R3 ← mem [8(R2)] IF ID EX MEM WB B

i add R3 ← R1, R2 i + 1 sub R4 ← R3, R8 IF IF ID ID EX EX MEM MEM WB WB i + 1 ID EX MEM WB B B B B B B B B i + 1 ID EX MEM WB 2. Data Hazards

A Data Hazard Solution • Interlock: Hardware that detects data dependency and stalls dependent instructions time instr 0 1 2 3 4 5 6 ADD IF ID EX MEM WB SUB IF stall stall ID EX MEM OR stall stall IF ID EX

IF ID EX MEM WB IF ID EX MEM IF ID EX Another Data Hazard Solution • Forwarding or Bypassing: forward the result as soon as available to EX add R3 ← R1, R2 sub R5 ← R3, R4 or R7 ← R3, R6

Other Data Hazards Solutions • Delayed loads • Require that instruction that uses load value be separated from the load instruction • Instruction Scheduling • Reorder instructions so that dependent instructions are far enough apart • Compile time vs run time instruction scheduling

Before Scheduling: LW R3 ← 0(R1) ADDI R5 ← R3, #1 ADD R2 ← R2, R3 LW R13 ← 0(R11) ADD R12 ← R13, R3 After Scheduling: LW R3 ← 0(R1) LW R13 ← 0(R11) ADDI R5 ← R3, #1 ADD R2 ← R2, R3 ADD R12 ← R13, R3 1 stall 1 stall Instruction Scheduling 0 stalls 2 stalls (following load)

Branch condition & target resolved here IF ID EX MEM WB B B B B IF ID EX MEM B B B B B IF ID EX IF ID 3. Control Hazards BEQZ R3, out Fetch instrn. (i +1) or from target? Fetch instrn. (i +1) or from target? Branch resolved; appropriate instruction correctly fetched

Lecture Summary • Computer architecture is the study of computer structures; design, evaluation, description • It builds on a background of computer organization, the study of how data can be represented and manipulated • Pipelined processors improve program execution time (instruction execution throughput) by overlapping in time the execution of many instructions

Next Week • Instruction Level Parallelism (ILP) and how it is exploited by current processors to improve program execution time even more

Intel and FAER’s Reach to Teach

Intel and FAER’s Reach to Teach

Presentation Transcript

PROJECT TEACH

Intel x86 Architecture

Kids in Crisis: How to Reach Them So You Can Teach Them

REACH Introduction and overview

You will find EVERYTHING you need to know about Intel ISEF at:

CoLT: Coalesced Large-Reach TLBs

Is REACH reaching out?

LEADERSHIP

Who Do We Reach, Who Can We Reach, and Who Should We Reach

Reading Reach

REACH – Introduction and overview

SCOR Experience at Intel

Computers in Our Classrooms

Metadata @ Intel

AMD vs. INTEL

REACH What Does It Mean To Me?