1 / 20

ECE 463/563 Fall `18

ECE 463/563 Fall `18. Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg. Static / Dynamic Scheduling. SUPERSCALAR PROCESSOR. compiler. hardware. Dynamic Static. dynamically re-order instructions. FU. moderately scheduled code. FU. FU. FU. VLIW PROCESSOR.

lsellers
Download Presentation

ECE 463/563 Fall `18

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 463/563Fall `18 Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  2. Static / Dynamic Scheduling SUPERSCALAR PROCESSOR compiler hardware • Dynamic • Static dynamically re-order instructions FU moderately scheduled code FU FU FU VLIW PROCESSOR compiler hardware statically re-order instructions FU moderately scheduled code FU FU FU ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  3. VLIW challenge #1: branches • Branches limit static scheduling window • “Basic block” (compiler and architecture term) • Formal definition: A code fragment with a single-entry and single-exit • Practical definition (a little different): code fragment ending in a branch • Local scheduling, limited to within a single basic block, does not expose enough ILP • How to form larger scheduling regions across multiple branches • Loop unrolling • Unroll loop body N times for larger loop body • Also reduces number of dynamic branches (lower IC) • Software pipelining (modulo scheduling) • Schedule dependent instructions from the same original iteration, across sw pipelined iterations • This puts independent instructions from different original iterations in the same sw pipelined iteration • Like loop unrolling, without literal unrolling • Trace scheduling: speculate a path and branch to fix-up code if misp. • Predication: execute both paths of branch, commit one path ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  4. Predication • ISA role • Provide predicate registers • Provide predicate-setting instructions (e.g., compare) • Some (partial predication) or all (full predication) opcodes can be guarded with predicates • Compiler role • Replace branch with predicate computation • Guard alternate paths with <predicate> and <!predicate> • Hardware role • Execute all predicated code, i.e., both paths after a branch • Do not commit either path until predicate is known • Conditionally commit or squash, depending on predicate Assembly code Source code Predicated assembly code (branch-free) A: blte r1,#0,D B: add r2,r2,r3 C: jump E D: sub r2,r2,r3 E: if (x > 0) y += z; else y -= z; A: cgt p1,r1,#0 // p1 = (r1 > 0) B: add r2,r2,r3,p1 // (p1 ? r2=r2+r3 : NOP) C: sub r2,r2,r3,!p1 // (!p1 ? r2=r2-r3 : NOP) ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  5. VLIW challenge #2: registers • Static scheduling window is limited by number of architectural registers • Just as OOO needs more register renaming registers (ROB) for a larger dynamic scheduling window Scenario #1: Only r1-r3 available. load r1, A add r2, r2, r1 load r1, B // anti-dependency with add instr. through r1 add r3, r3, r1 Scenario #2: r1-r4 available. load r1, A add r2, r2, r1 load r4, B add r3, r3, r4 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  6. VLIW challenge #3: cache misses • Uncertain, disparate latencies are hard to statically schedule • Hence OOO in general-purpose high-perf. processors ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  7. What is an ISA? • Instruction Set Architecture (ISA) is a specification of thehardware-software interface • ISA defines: • STATE OF THE PROGRAM: • Registers • General-purpose registers, for example: • Integer registers • Floating-point registers • SIMD / Vector registers • Special-purpose registers, for example: • Program counter (PC) • Condition code registers • System control registers • TLB • Other… • Memory • WHAT INSTRUCTIONS DO: Which state they use, which state they update, and how • HOW INSTRUCTIONS ARE REPRESENTED: instr. formats, bit encodings ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  8. Why is the ISA important? • Choices affect performance, cost, and power, often in complex ways. We’ll use classic CISC vs. RISC distinctions to demonstrate how choices affect: • memory cost for instructions • performance factors: IC, CPI, and CT ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  9. CISC vs. RISC • CISC: Complex Instruction Set Computing • Many, diverse instructions • Individual instructions encapsulate a lot of work • E.g., complex addressing modes, exotic instructions, etc. • Variable-length instruction encodings • Memory-memory or register-memory architecture • Arithmetic instructions can have both register and memory operands • The implication is that an arithmetic instruction is more than an ALU operation. It has one or more implied load/store operations to load operands from memory / store operands to memory. • RISC: Reduced Instruction Set Computing • Fewer, less diverse instructions • Instructions are primitive (convey piecemeal amount of work) • Fixed-length instruction encodings • Load-store architecture • Arithmetic instructions can have only register operands • Memory only accessed via explicit load and store instructions ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  10. CISC vs. RISC (cont.) • Memory cost for instructions • CISC binaries generally more compact than RISC binaries due to: • Variable-length instruction encoding • More work conveyed by a single instruction • Performance factors (IC, CPI, CT) • CISC expresses same amount of work with fewer dynamic instructions • ↓ IC • RISC lends itself to more efficient, higher performance pipelines (CISC has workarounds, e.g., x86 micro-ops: see next slide) • Fixed-length instruction encodings =>Easier to decode multiple instructions in parallel since you know where each instruction in the decode bundle starts and ends, in advance. • Simple instructions, uniformity (just a few major classes of instructions) => Efficient pipeline. • ↓ CPI, CT -or- lower cost and power for same CPI, CT ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  11. CISC vs. RISC (cont.) • How to design efficient pipelines for CISC ISAs • Micro-ops • In decode stage, crack CISC instructions into one or more RISC-like micro-operations • All subsequent pipeline stages are designed for an internal ISA that looks RISC • Examples: x86 micro-ops in Intel and AMD processors • Binary translation • Program binary is first translated from one ISA to another, and the translated binary is what is run • Static binary translation: translate once, use this binary over and over • Dynamic binary translation: translate each time the program is run, either all at once at the beginning or incrementally as the program runs ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  12. Outline of Miscellaneous ISA Topics • Small issues you need to be aware of • Alignment • Endian-ness • Expressing parallelism in ISAs ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  13. Load and Store Instructions • Different load and store instructions for different data sizes • load byte / store byte (1 byte) • load halfword / store halfword (2 bytes) • load word / store word (4 bytes) • load doubleword / store doubleword (8 bytes) • Anything larger than a byte introduces the issue of “aligned” versus “unaligned” accesses if (load/store address is an integer multiple of the data size) access is “aligned” else access is “unaligned” ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  14. memory (bytes) memory (bytes) memory (bytes) memory (bytes) 0 0 0 0 1 1 1 1 2 2 2 2 aligned halfword accesses 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 memory (bytes) memory (bytes) memory (bytes) 0 0 0 1 1 1 2 2 2 unaligned halfword accesses 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  15. memory (bytes) memory (bytes) 0 0 1 1 2 2 aligned word accesses 3 3 4 4 5 5 6 6 7 7 memory (bytes) memory (bytes) memory (bytes) 0 0 0 1 1 1 2 2 2 unaligned word accesses 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  16. Impact of Unaligned Accesses on Hardware Complexity Aligned Word Access word-aligned access boundaries cache block #1 0 1 2 3 4 5 6 7 cache block #2 8 9 10 11 12 13 14 15 Unaligned Word Access word-aligned access boundaries cache block #1 0 1 2 3 4 5 6 7 cache block #2 8 9 10 11 12 13 14 15 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  17. How software can help memory (bytes) memory (bytes) c c 0 0 w • Pad data structures struct { char c; int w; } • Emulate unaligned access with multiple instructions • 2 word-aligned load instructions • The unaligned word spans two aligned words • 2 AND instructions to mask unused bytes in the two aligned words • 1 OR instruction to merge the useful bytes from the two aligned words • 1 rotate instruction to reorder the bytes 1 1 pad 2 2 3 3 w 4 4 5 5 6 6 7 7 load #1 0 1 2 3 load #2 4 5 6 7 2 AND, 1 OR rotate 4 5 2 3 2 3 4 5 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  18. Alignment Policy: Example of ISA affecting hardware/software tradeoffs ↑ CPI, ↑ energy, ↑ h/w cost ↑ IC, ↑ energy -or- ↑ memory waste (padding) ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  19. Endian-ness • Consider the integer value 0xDEADBEEF • 4-byte value • When 4-byte value is stored in memory, where is the most-significant byte (MSB)? • Choice is arbitrary, and two camps evolved. Big-endian (e.g., IBM PowerPC) Byte address 0 is the “big end”: it contains the MSB Little-endian (e.g., x86) Byte address 0 is the “little end”: it contains the LSB Memory (bytes) Memory (bytes) address data address data 0 0 DE EF 1 1 AD BE 2 2 BE AD 3 3 EF DE 4 4 5 5 6 6 7 7 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

  20. Expressing Parallelism in the ISA • Most commercial ISAs are “Von Neumann architecture” • Very sequential, doesn’t express parallelism • Superscalar Processors expend significant effort uncovering inherent instruction-level parallelism • Unconventional ISAs express parallelism explicitly • SIMD and Vector: Express data-level parallelism • Most commercial ISAs (x86, ARM, MIPS, Power) now incorporate SIMD/vector as ISA extensions for multimedia and scientific computing • VLIW: Express instruction-level parallelism • Commercial success stories: some general-purpose processors (e.g., Intel’s IA64), many digital signal processors (e.g., Texas Instruments DSPs) • Dataflow: Eliminate notion of a single program counter. Instruction sequencing is data-driven: producer instructions point to consumer instructions. ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

More Related