1 / 51

VLIW Computing

VLIW Computing. Serge Vaks Mike Roznik Aakrati Mehta. Presentation Overview. VLIW Overview Instruction Level Parallelism (most relevant) Top cited articles Latest Research. VLIW Overview. A VLIW computer is based on an architecture that implements Instruction Level Parallelism (ILP)

tbrigette
Download Presentation

VLIW Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLIW Computing Serge Vaks Mike Roznik Aakrati Mehta

  2. Presentation Overview • VLIW Overview • Instruction Level Parallelism (most relevant) • Top cited articles • Latest Research

  3. VLIW Overview • A VLIW computer is based on an architecture that implements Instruction Level Parallelism (ILP) • meaning execution of multiple instructions at the same time • A Very Long Instruction Word (VLIW) specifies multiple numbers of primitive operations that are grouped together • They are passed to a register file that executes the instruction with the help of functional units provided as part of the hardware

  4. VLIW Overview

  5. Static Scheduling • Unlike Super Scalar architectures, in the VLIW architecture all the scheduling is static • This means that they are not done at runtime by the hardware but are handled by the compiler. • The compiler takes the complex instructions that need to be handled, as a result of Instruction Level Parallelism and compiles them into object code • The object code is then passed to the register file

  6. Static Scheduling • It is this object code that is referred to as the Very Long Instruction Word (VLIW). • The compiler prearranges the object code so the VLIW chip can quickly execute the instructions in parallel • This frees up the microprocessor from having to perform the complex and continual runtime analysis that Super Scalar RISC and CISC chips must do.

  7. VLIW vs Super Scalar • Super Scalar architectures, in contrast, use dynamic scheduling that transform all ILP complexity to the hardware • This leads to greater hardware complexity that is not seen in VLIW hardware • VLIW chips don’t need most of the complex circuitry that Super Scalar chips must use to coordinate parallel execution at runtime

  8. VLIW vs Super Scalar • Thus in VLIW hardware complexity is greatly reduced • the executable instructions are generated directly by the compiler • they are then passed as “native code” by the functional units present in the hardware • VLIW chips can • cost less • burn less power • achieve significantly higher performance than comparable RISC and CISC chips

  9. Tradeoffs • VLIW architecture still has many problems it must overcome • code expansion • high power consumption • scalability

  10. Tradeoffs • Also the VLIW compiler is specific • it is an integral part of the VLIW system • A poor VLIW compiler will have a much more negative impact on performance than would a poor RISC or CISC compiler

  11. History and Outlook • VLIW predates the existing Super Scalar technology, which has proved more useful up until now • Recent advances in computer technology, especially smarter compilers, are leading to a rebirth and resurgence of VLIW architectures • So potentially it could still have a very promising future ahead of it

  12. Western Research Laboratory (WRL) Research Report 89/7 Available Instruction-level Parallelism for Superscalar and Superpipelined Machines By Norman P. Jouppi and David W. Wall

  13. Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. Ways of Exploiting Instruction-level Parallelism (ILP)

  14. Load C1<- 23 (R2) Add R3 <- R3 +1 FPAdd C4 <- C4 + C3 Parallelism = 3 Add R3<-R3 +1 Add R4<-R3 + R2 Store 0 [R4] <- R0 Parallelism = 1 Example code fragments for ILP

  15. A Machine Taxonomy • Operation Latency :- A time (in cycles) until the result of an instruction is available for use as operand in a subsequent instruction. • Simple Operations :- Operations such as integer add, logical ops, loads, stores, branches, floating point addition, multiplication are simple operations.Divide and cache misses are not. • Instruction class :- A group of instructions all issued to the same type of functional unit. • Issue Latency :- The time (in cycles) required between issuing two instructions.

  16. Various Methods • The Base Machine • Instructions issued per cycle = 1 • Simple operation latency measured in cycles = 1 • Instruction-Level Parallelism required to fully utilize = 1 • Underpipelined Machines • Executes an operation and writes back the result in the same pipestage. • It has a cycle time greater than the latency of a simple operation or • it issues less than one instruction per cycle. • Superscalar Machines • Instructions issued per cycle = n at all times • Simple operation latency measured in cycles = 1 • Instruction-Level Parallelism required to fully utilize = n

  17. Key Properties of VLIW Machines • VLIW have instructions hundreds of bits long. Each instruction can specify many operations, so each instruction exploits ILP. • The VLIW instructions have fixed format. The operations specifiable in one instruction do not exceed the resources of the machine, unlike superscalar machines. • In effect, the selection of which operations to issue in a given cycle is performed at compile time in a VLIW machine and at run time in a superscalar machine. • The instruction decode logic for VLIW machine is simpler. • The fixed VLIW format includes bits for unused operations. • VLIW machines that are able to exploit more parallelism would require larger instructions.

  18. VLIW Vs Superscalar There are three differences between Superscalar versus VLIW instructions:- Decoding of VLIW instructions is easier than superscalar instructions. When the available instruction-level parallelism is less than that exploitable by the VLIW machine, the code density of the superscalar machine will be better. Superscalar machine could be object-code compatible with a large family of non-parallel machines, but VLIW machines exploiting different amounts of parallelism would require different instruction sets.

  19. IFetch Decode Execute WriteBack Execution in a VLIW machine Key: Successive Instructions 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles

  20. Class Conflicts • There are two ways to develop a superscalar machine of n degree from a base machine. • Duplicate all functional units of n times, including register ports, bypasses, busses, instruction decode logic. • Duplicate only the register ports, bypasses, busses, and instruction decode logic. These two method are extreme cases, and one could duplicate some units and not others. But if all functional units are not duplicated, then potential class conflicts will be created. A class conflict occurs when some instruction is followed by another instruction or the same functional unit.

  21. Superpipelined Machines • Instructions issued per cycle=1, but cycle time is 1/m of the base machine • Simple operation latency measured in cycles=m • Instruction-level parallelism required to fully utilize=m Key: IFetch Decode Execute WriteBack Successive Instructions 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles Superpipelined execution(m=3)

  22. Superpipelined Superscalar Machines • Instructions issued per cycle=n, but cycle time is 1/m of the base machine • Simple operation latency measured in cycles=m • Instruction-level parallelism required to fully utilize=n*m Key: IFetch Decode Execute WriteBack 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles Superpipelined Superscalar execution(n=3, m=3)

  23. Vector Machines • Vector machines can also take advantage of ILP • Each of the machine could have an attached vector unit.It shows parallel execution of vector instructions. • Each vector instruction results in a string of operations, one for each element in the vector. Successive Instructions 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles Superpipelined Superscalar execution(n=3, m=3)

  24. Supersymmetry • A superscalar machine of degree three can have three instructions execting at the same time by issuing three at the same time. • The superpipelined machine can have three instructions executing at the same time by having a cycle time 1/3 that of superscalar machine, and issuing three instructions in successive cycles. So as far as supersymmetry is concerned,both superscalar and superpipelined machines of equal degree have basically the same performance.

  25. Limits of Instruction Level Parallelism- Wall [91] How much parallelism is there to exploit?

  26. Wall’s experimental framework of 18 test programs draws the following aspects • Data dependency:- Result of the instruction is the operand of the second instruction. • Anti-dependency:- The first instruction uses the old value in some location and the second sets that location to a new value. • Output dependency:- Both instructions assign value to the same location. • Control dependency:- This is between a branch and an instruction whose execution is conditional on it.

  27. Wall’s experimental framework of 18 test programs draws the following aspects r1 : = 20 [r4] r2: = r1 + r4 ……… ……….. r2 : = r1 +1 r1:= r17 – 1 (a) True data dependency (b) anti-dependency r1 : = r2 * r3 if r17 : = 0 goto L ………… ……….. r1 : = 0 [r7] r1 : = r2 + r3 (c) output dependencies (d) control dependencies

  28. Register Renaming • Anti-dependencies and output dependencies on registers are often accidents of the compiler’s register allocation technique. • Register renaming is a hardware method which imposes a level of indirection between the register number appearing in the instruction and the actual register used. • Perfect renaming (assume infinite no of registers) • Finite renaming (assume finite register set dynamically allocated using an LRU) • None (use the register specified in the code)

  29. Alias Analysis • Like registers memory locations can also carry true and false dependencies but…........ • Memory is much larger than register file • Hard to tell when a memory-carried dependency exists • The register used by an instruction are manifest in the instruction itself,while memory location used is not manifest and may be different for different executions of the instruction. This may lead to assuming dependencies which are not leading to the aliasing problem. • Alias analysis types are :- • Perfect alias analysis • No alias analysis • Alias by instruction inspection • Alias analysis by compiler

  30. Branch Prediction • Speculative execution:- • Parallelism within a basic block is usually limited, mainly because basic blocks are usually quite small. Speculative execution tries to mitigate this by scheduling instructions across branches • Branch Prediction : - • The hardware or the software predicts which way a given branch will most likely go, and speculatively schedules instructions from that path.

  31. Example VLIW Processors “Automatic Exploration of VLIW Processor Architectures from a Designer’s Experience Based Specification” Dr.’s Auguin, Boeri & Carriere “VIPER: A VLIW Integer Microprocessor” Dr.’s Gray, Naylor, Abnous & Bagherzadeh

  32. Example VLIW Processors • RISC architecture utilizes temporal parallelism whereas VLIW architecture utilizes spatial parallelism • Superscalar processors schedule the order of operations at run time demanding more hardware, VLIW schedule at run time making for simpler hardware paths • These large instruction words can be used to either contain more complex instructions or more instructions. • Requires more or larger registers to hold

  33. Example VLIW Processors • Less hardware needed which leads to: • less power consumption • less heat • cheaper cost to make • How do you achieve the full speed of a VLIW chip? • Decoding of multiple instructions at once • More hardware • More complex compilers

  34. Example VLIW Processors

  35. Viper Processor • Executes four 32 bit operations concurrently • Up to 2 load/store operations at once • Less hardware on chip allows for up to 75% more cache • Greater cache performance means faster • To solve the compiler problem Viper uses only one ALU • Cheaper overall then a chip of similar speed • There is a greater cost of production due to new technology

  36. Viper Processor

  37. Viper Processor

  38. Current Research • The focus of my area is the current research that’s taking place in relation to VLIW architectures • Roughly half of the latest research papers I examined had to do with some aspect of clustered VLIW architectures • Since this seems a very hot topic of research I chose two papers that I thought were most representative of this topic

  39. Current Research • “An effective software pipelining algorithm for clustered embedded VLIW processors” by C. Akturan, M. Jacome September 2002. • “Application-specific clustered VLIW datapaths: Early exploration on a parameterized design space” by V. Lapinskii, M. Jacome, G. de Veciana August 2002.

  40. Why clusters? • In order to take full advantage of instruction level parallelism extracted by software pipelining, Very Large Instruction Word (VLIW) processors with a large number of functional units (FU’s) are typically required • Unfortunately, architectures with centralized register file architectures scale poorly as the number of FU’s increases

  41. Why clusters? • centralized architectures quickly become prohibitively costly in terms of • clock rate • power dissipation • delay • area • overall design complexity

  42. Clusters • In order to control the penalties associated with an excessive number of register file (RF) ports • While still providing all functional units necessary to exploit the available ILP • We restrict the connectivity between functional units and registers

  43. Clusters • We restructure a VLIW datapath into a set of clusters • Each cluster in the datapath contains a set of functional units connected to a local register file • The clock rate of a clustered VLIW machine is likely to be significantly faster than that of a centralized machine with the same number of FU’s

  44. Clusters

  45. Pentium II

  46. Good DataPath Configurations • The first paper is by Lapinskii. It tries to expose good datapath configurations among the different set of possible design choices • Break up the possible set of design decisions into design slices • Focus on parameters that have a first-order impact on key physical figures of merit • clock rate • power dissipation

  47. Good DataPath Configurations • Each slice has the following properties (parameters) • cluster capacity • number of clusters • bus (interconnect) capacity • With their methodology they explore the different design decisions by varying these parameters

  48. Software-Pipelining Algorithm • The next paper is by Akturan. It presents a software-pipelining algorithm called CALiBeR • CALiBeR takes code, loop bodies in particular, and reschedules it in such a way so as to take advantage of the inherent ILP • it than binds the instructions to a given clustered datapath configuration

  49. CALiBeR • Although CALiBeR is made for compilers targeting embedded VLIW processors • it can be applied more generally • It can handle heterogeneous clustered datapath configurations • clusters with any number of FU’s • clusters with any type of FU’s • multi-cycle FU’s • pipelined FU’s

More Related