200 likes | 217 Views
ECE 463/563 Fall `18. Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg. Static / Dynamic Scheduling. SUPERSCALAR PROCESSOR. compiler. hardware. Dynamic Static. dynamically re-order instructions. FU. moderately scheduled code. FU. FU. FU. VLIW PROCESSOR.
E N D
ECE 463/563Fall `18 Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Static / Dynamic Scheduling SUPERSCALAR PROCESSOR compiler hardware • Dynamic • Static dynamically re-order instructions FU moderately scheduled code FU FU FU VLIW PROCESSOR compiler hardware statically re-order instructions FU moderately scheduled code FU FU FU ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
VLIW challenge #1: branches • Branches limit static scheduling window • “Basic block” (compiler and architecture term) • Formal definition: A code fragment with a single-entry and single-exit • Practical definition (a little different): code fragment ending in a branch • Local scheduling, limited to within a single basic block, does not expose enough ILP • How to form larger scheduling regions across multiple branches • Loop unrolling • Unroll loop body N times for larger loop body • Also reduces number of dynamic branches (lower IC) • Software pipelining (modulo scheduling) • Schedule dependent instructions from the same original iteration, across sw pipelined iterations • This puts independent instructions from different original iterations in the same sw pipelined iteration • Like loop unrolling, without literal unrolling • Trace scheduling: speculate a path and branch to fix-up code if misp. • Predication: execute both paths of branch, commit one path ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Predication • ISA role • Provide predicate registers • Provide predicate-setting instructions (e.g., compare) • Some (partial predication) or all (full predication) opcodes can be guarded with predicates • Compiler role • Replace branch with predicate computation • Guard alternate paths with <predicate> and <!predicate> • Hardware role • Execute all predicated code, i.e., both paths after a branch • Do not commit either path until predicate is known • Conditionally commit or squash, depending on predicate Assembly code Source code Predicated assembly code (branch-free) A: blte r1,#0,D B: add r2,r2,r3 C: jump E D: sub r2,r2,r3 E: if (x > 0) y += z; else y -= z; A: cgt p1,r1,#0 // p1 = (r1 > 0) B: add r2,r2,r3,p1 // (p1 ? r2=r2+r3 : NOP) C: sub r2,r2,r3,!p1 // (!p1 ? r2=r2-r3 : NOP) ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
VLIW challenge #2: registers • Static scheduling window is limited by number of architectural registers • Just as OOO needs more register renaming registers (ROB) for a larger dynamic scheduling window Scenario #1: Only r1-r3 available. load r1, A add r2, r2, r1 load r1, B // anti-dependency with add instr. through r1 add r3, r3, r1 Scenario #2: r1-r4 available. load r1, A add r2, r2, r1 load r4, B add r3, r3, r4 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
VLIW challenge #3: cache misses • Uncertain, disparate latencies are hard to statically schedule • Hence OOO in general-purpose high-perf. processors ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
What is an ISA? • Instruction Set Architecture (ISA) is a specification of thehardware-software interface • ISA defines: • STATE OF THE PROGRAM: • Registers • General-purpose registers, for example: • Integer registers • Floating-point registers • SIMD / Vector registers • Special-purpose registers, for example: • Program counter (PC) • Condition code registers • System control registers • TLB • Other… • Memory • WHAT INSTRUCTIONS DO: Which state they use, which state they update, and how • HOW INSTRUCTIONS ARE REPRESENTED: instr. formats, bit encodings ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Why is the ISA important? • Choices affect performance, cost, and power, often in complex ways. We’ll use classic CISC vs. RISC distinctions to demonstrate how choices affect: • memory cost for instructions • performance factors: IC, CPI, and CT ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
CISC vs. RISC • CISC: Complex Instruction Set Computing • Many, diverse instructions • Individual instructions encapsulate a lot of work • E.g., complex addressing modes, exotic instructions, etc. • Variable-length instruction encodings • Memory-memory or register-memory architecture • Arithmetic instructions can have both register and memory operands • The implication is that an arithmetic instruction is more than an ALU operation. It has one or more implied load/store operations to load operands from memory / store operands to memory. • RISC: Reduced Instruction Set Computing • Fewer, less diverse instructions • Instructions are primitive (convey piecemeal amount of work) • Fixed-length instruction encodings • Load-store architecture • Arithmetic instructions can have only register operands • Memory only accessed via explicit load and store instructions ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
CISC vs. RISC (cont.) • Memory cost for instructions • CISC binaries generally more compact than RISC binaries due to: • Variable-length instruction encoding • More work conveyed by a single instruction • Performance factors (IC, CPI, CT) • CISC expresses same amount of work with fewer dynamic instructions • ↓ IC • RISC lends itself to more efficient, higher performance pipelines (CISC has workarounds, e.g., x86 micro-ops: see next slide) • Fixed-length instruction encodings =>Easier to decode multiple instructions in parallel since you know where each instruction in the decode bundle starts and ends, in advance. • Simple instructions, uniformity (just a few major classes of instructions) => Efficient pipeline. • ↓ CPI, CT -or- lower cost and power for same CPI, CT ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
CISC vs. RISC (cont.) • How to design efficient pipelines for CISC ISAs • Micro-ops • In decode stage, crack CISC instructions into one or more RISC-like micro-operations • All subsequent pipeline stages are designed for an internal ISA that looks RISC • Examples: x86 micro-ops in Intel and AMD processors • Binary translation • Program binary is first translated from one ISA to another, and the translated binary is what is run • Static binary translation: translate once, use this binary over and over • Dynamic binary translation: translate each time the program is run, either all at once at the beginning or incrementally as the program runs ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Outline of Miscellaneous ISA Topics • Small issues you need to be aware of • Alignment • Endian-ness • Expressing parallelism in ISAs ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Load and Store Instructions • Different load and store instructions for different data sizes • load byte / store byte (1 byte) • load halfword / store halfword (2 bytes) • load word / store word (4 bytes) • load doubleword / store doubleword (8 bytes) • Anything larger than a byte introduces the issue of “aligned” versus “unaligned” accesses if (load/store address is an integer multiple of the data size) access is “aligned” else access is “unaligned” ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
memory (bytes) memory (bytes) memory (bytes) memory (bytes) 0 0 0 0 1 1 1 1 2 2 2 2 aligned halfword accesses 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 memory (bytes) memory (bytes) memory (bytes) 0 0 0 1 1 1 2 2 2 unaligned halfword accesses 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
memory (bytes) memory (bytes) 0 0 1 1 2 2 aligned word accesses 3 3 4 4 5 5 6 6 7 7 memory (bytes) memory (bytes) memory (bytes) 0 0 0 1 1 1 2 2 2 unaligned word accesses 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Impact of Unaligned Accesses on Hardware Complexity Aligned Word Access word-aligned access boundaries cache block #1 0 1 2 3 4 5 6 7 cache block #2 8 9 10 11 12 13 14 15 Unaligned Word Access word-aligned access boundaries cache block #1 0 1 2 3 4 5 6 7 cache block #2 8 9 10 11 12 13 14 15 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
How software can help memory (bytes) memory (bytes) c c 0 0 w • Pad data structures struct { char c; int w; } • Emulate unaligned access with multiple instructions • 2 word-aligned load instructions • The unaligned word spans two aligned words • 2 AND instructions to mask unused bytes in the two aligned words • 1 OR instruction to merge the useful bytes from the two aligned words • 1 rotate instruction to reorder the bytes 1 1 pad 2 2 3 3 w 4 4 5 5 6 6 7 7 load #1 0 1 2 3 load #2 4 5 6 7 2 AND, 1 OR rotate 4 5 2 3 2 3 4 5 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Alignment Policy: Example of ISA affecting hardware/software tradeoffs ↑ CPI, ↑ energy, ↑ h/w cost ↑ IC, ↑ energy -or- ↑ memory waste (padding) ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Endian-ness • Consider the integer value 0xDEADBEEF • 4-byte value • When 4-byte value is stored in memory, where is the most-significant byte (MSB)? • Choice is arbitrary, and two camps evolved. Big-endian (e.g., IBM PowerPC) Byte address 0 is the “big end”: it contains the MSB Little-endian (e.g., x86) Byte address 0 is the “little end”: it contains the LSB Memory (bytes) Memory (bytes) address data address data 0 0 DE EF 1 1 AD BE 2 2 BE AD 3 3 EF DE 4 4 5 5 6 6 7 7 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Expressing Parallelism in the ISA • Most commercial ISAs are “Von Neumann architecture” • Very sequential, doesn’t express parallelism • Superscalar Processors expend significant effort uncovering inherent instruction-level parallelism • Unconventional ISAs express parallelism explicitly • SIMD and Vector: Express data-level parallelism • Most commercial ISAs (x86, ARM, MIPS, Power) now incorporate SIMD/vector as ISA extensions for multimedia and scientific computing • VLIW: Express instruction-level parallelism • Commercial success stories: some general-purpose processors (e.g., Intel’s IA64), many digital signal processors (e.g., Texas Instruments DSPs) • Dataflow: Eliminate notion of a single program counter. Instruction sequencing is data-driven: producer instructions point to consumer instructions. ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg