380 likes | 693 Views
Chapter 21 IA-64 Architecture (Think Intel Itanium). also known as ( EPIC – Extremely Parallel Instruction Computing). Superpipelined & Superscaler Machines. Superpipelined machine: Superpiplined machines overlap pipe stages
E N D
Chapter 21IA-64 Architecture(Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing)
Superpipelined & Superscaler Machines Superpipelined machine: • Superpiplined machines overlap pipe stages • Relies on stages being able to begin operations before the last is complete. Superscaler Machine: A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel. • Particularly common instructions (arithmetic, load/store, conditional branch) can be executed independently.
Why A New Architecture Direction? Processor designers obvious choices for use of increasing number of transistors on chip and extra speed: • Bigger Caches diminishing returns • Increase degree of Superscaling by adding more execution units complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies. • Multiple Processors challenge to use them effectively in general computing • Longer pipelines greater penalty for misprediction
IA-64 : Background • Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP) • New 64 bit architecture • Not extension of x86 series • Not adaptation of HP 64bit RISC architecture • To exploit increasing chip transistors and increasing speeds • Utilizes systematic parallelism • Departure from superscalar trend Note: Became the architecture of the Intel Itanium
Basic Concepts for IA-64 • Instruction level parallelism • EXPLICIT in machine instruction, rather than determined at run time by processor • Long or very long instruction words (LIW/VLIW) • Fetch bigger chunks already “preprocessed” • Predicated Execution • Marking groups of instructions for a late decision on “execution”. • Control Speculation • Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later • Data Speculation (or Speculative Loading) • Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong • Software Pipelining • Multiple iterations of a loop can be executed in parallel • “Revolvable” Register Stack • Stack Frames are programmable and used to reduce unnecessary movement of data on procedure calls
IA-64 Key Hardware Features • Large number of registers • IA-64 instruction format assumes 256 Registers • 128 * 64 bit integer, logical & general purpose • 128 * 82 bit floating point and graphic • 64 (1 bit) predicated execution registers (To support high degree of parallelism) • Multiple execution units • Probably 8 or more pipelined
Predicate Registers • Used as a flag for instructions that may or may not be executed. • A set of instructions is assigned a predicate register when it is uncertain whether the instruction sequence will actually be executed (think branch). • Only instructions with a predicate value of true are executed. • When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed. • Those instructions with predicate false are now candidates for cleanup.
Instruction Format 128 bit bundles • Can fetch one or more bundles at a time • Bundle holds three instructions plus template • Instructions are usually 41 bit long • Have associated predicated execution registers • Template contains info on which instructions can be executed in parallel • Not confined to single bundle • e.g. a stream of 8 instructions may be executed in parallel • Compiler will have re-ordered instructions to form contiguous bundles • Can mix dependent and independent instructions in same bundle
IA-64 Execution Units • I-Unit • Integer arithmetic • Shift and add • Logical • Compare • Integer multimedia ops • M-Unit • Load and store • Between register and memory • Some integer ALU operations • B-Unit • Branch instructions • F-Unit • Floating point instructions
Field Encoding & Instr Set Mapping Note: BAR indicates stops: Possible dependencies with Instructions after the stop
Assembly Language Format [qp] mnemonic [.comp] dest = srcs ;; // • qp - predicate register • 1 at execution time execute and commit result to hardware • 0 at execution time result is discarded • mnemonic - name of instruction • comp – one or more instruction completers used to qualify mnemonic • dest – one or more destination operands • srcs – one or more source operands • ;; -instruction groups stops • Sequence without hazards - read after write, write after write, . . • // - comment follows
Assembly Example ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group • Second instruction depends on value in r1 • Changed by first instruction • Can not be in same group for parallel execution • Note ;; ends the group of instructions that can be executed in parallel Register Dependency:
Assembly Example ld8 r1 = [r5] //first group sub r6 = r8, r9 ;; //first group add r3 = r1, r4 //second group st8 [r6] = r12 //second group • Last instruction stores in the memory location whose address is in r6, which is established in the second instruction Multiple Register Dependencies:
Assembly Example – Predicated Code if (a && b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Consider the Following program with branches:
Assembly Example – Predicated Code Pentium Assembly Code cmp a, 0 ; compare with 0 je L1 ; branch to L1 if a = 0 cmp b, 0 je L1 add j, 1 ; j = j + 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 ; k = k + 1 jmp L3 L2: sub k, 1 ; k = k – 1 L3: add i, 1 ; i = i + 1 Source Code if (a && b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1;
Assembly Example – Predicated Code Source Code if (a && b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Pentium Code cmp a, 0 je L1 cmp b, 0 je L1 add j, 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 jmp L3 L2: sub k, 1 L3: add i, 1 IA-64 Code cmp.eq p1, p2 = 0, a ;; (p2) cmp.eq p1, p3 = 0, b (p3) add j = 1, j (p1) cmp.ne p4, p5 = 0, c (p4) add k = 1, k (p5) add k = -1, k add i = 1, i
Example of Prediction IA-64 Code: cmp.eq p1, p2 = 0, a ;; (p2) cmp.eq p1, p3 = 0, b (p3) add j = 1, j (p1) cmp.ne p4, p5 = 0, c (p4) add k = 1, k (p5) add k = -1, k add i = 1, i
Data Speculation • Load data from memory before needed • What might go wrong? • Load could be completed before another required read or could later be shown to be incorrect • Need subsequent check in value ?
Assembly Example – Data Speculation (p1) br some_label // cycle 0 ld8 r1 = [r5] ;; // cycle 0 (indirect memory op – 2 cycles) add r1 = r1, r3 // cycle 2 Consider the following code:
Assembly Example – Data Speculation (p1) br some_label //cycle 0 ld8 r1 = [r5] ;; //cycle 0 add r1 = r1, r3 //cycle 2 Consider the following code: Original code Speculated Code ld8.s r1 = [r5] ;; //cycle -2 // other instructions (p1) br some_label //cycle 0 chk.s r1, recovery //cycle 0 add r2 = r1, r3 //cycle 0
Assembly Example – Data Speculation st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 (indirect memory op – 2 cycles) add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Consider the following code: What if r4 and r8 point to the same address?
Assembly Example – Data Speculation st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Consider the following code: Without Data Speculation With Data Speculation ld8.a r6 = [r8] ;; //cycle -2, adv // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0, check add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1 Note: The Advanced load Address Table is checked for an entry. It should be there. If another access has been made to that target, it would have been removed.
Assembly Example – Data Speculation Consider the following code with an additional data dependency: Speculation Speculation with data dependency ld8.a r6 = [r8];; //cycle -3,adv ld // other instructions add r5 = r6, r7 //cycle -1,uses r6 // other instructions st8 [r4] = r12 //cycle 0 chk.a r6, recover //cycle 0, check back: //return pt st8 [r18] = r5 //cycle 0 recover: ld8 r6 = [r8] ;; //get r6 from [r8] add r5 = r6, r7;; //re-execute br back //jump back ld8.a r6 = [r8] ;; //cycle-2 // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0 add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1
Software Pipelining Consider loop in which: y[i] = x[i] + c L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 r9 holds c st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 • Adds constant to one vector and stores result in another • No opportunity for instruction level parallelism in one iteration • Instruction in iteration x all executed before iteration x+1 begins
Pipeline - Unrolled Loop, Pipeline Display Unrolled loop ld4 r32=[r5],4;; //cycle 0 ld4 r33=[r5],4;; //cycle 1 ld4 r34=[r5],4 //cycle 2 add r36=r32,r9;; //cycle 2 ld4 r35=[r5],4 //cycle 3 add r37=r33,r9 //cycle 3 st4 [r6]=r36,4;; //cycle 3 ld4 r36=[r5],4 //cycle 3 add r38=r34,r9 //cycle 4 st4 [r6]=r37,4;; //cycle 4 add r39=r35,r9 //cycle 5 st4 [r6]=r38,4;; //cycle 5 add r40=r36,r9 //cycle 6 st4 [r6]=r39,4;; //cycle 6 st4 [r6]=r40,4;; //cycle 7 Original Loop L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7, 4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 Pipeline Display
Mechanism for “Unrolling” Loops • Automatic Register Naming • r 32-r127, fr 32-fr127, and pr 16-pr63 are capable of rotation for automatic renaming of registers • Predication of Loops • each instruction in a given loop is predicated. • on the prolog, each cycle an additional instruction predicate is true • on the kernel, n instruction’s predicates are true • on the Epilog, each cycle an additional predicate is made false • Spl Loop Termination Instructions • the loop count and epilog count is used to determine when the loop is complete and the process stops
Unrolled Loop Example Observations • Completes 5 iterations in 7 cycles • Compared with 20 cycles in original code • Assumes two memory ports • Load and store can be done in parallel
IA-64 Register Stack • The Register Stack mechanism avoids unecessary movement of register data during procedure call and return (r32-r127 are used in a rotation) • the number of local, & pass/return are specifiable • the “register renaming” allows locals to become hidden and pass/return to become local on a call, and changed back on a return • IF the stacking mechanism runs out of registers, the last used are moved to memory
Basic Concepts for IA-64 • Instruction level parallelism • EXPLICIT in machine instruction, rather than determined at run time by processor • Long or very long instruction words (LIW/VLIW) • Fetch bigger chunks already “preprocessed” • Predicated Execution • Marking groups of instructions for a late decision on “execution”. • Control Speculation • Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later • Data Speculation (or Speculative Loading) • Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong • Software Pipelining • Multiple iterations of a loop can be executed in parallel • “Revolvable” Register Stack • Stack Frames are programmable and used to reduce unnecessary movement of data on procedure calls