530 likes | 633 Views
VLIW Computing. Serge Vaks Mike Roznik Aakrati Mehta. Presentation Overview. VLIW Overview Instruction Level Parallelism (most relevant) Top cited articles Latest Research. VLIW Overview. A VLIW computer is based on an architecture that implements Instruction Level Parallelism (ILP)
E N D
VLIW Computing Serge Vaks Mike Roznik Aakrati Mehta
Presentation Overview • VLIW Overview • Instruction Level Parallelism (most relevant) • Top cited articles • Latest Research
VLIW Overview • A VLIW computer is based on an architecture that implements Instruction Level Parallelism (ILP) • meaning execution of multiple instructions at the same time • A Very Long Instruction Word (VLIW) specifies multiple numbers of primitive operations that are grouped together • They are passed to a register file that executes the instruction with the help of functional units provided as part of the hardware
Static Scheduling • Unlike Super Scalar architectures, in the VLIW architecture all the scheduling is static • This means that they are not done at runtime by the hardware but are handled by the compiler. • The compiler takes the complex instructions that need to be handled, as a result of Instruction Level Parallelism and compiles them into object code • The object code is then passed to the register file
Static Scheduling • It is this object code that is referred to as the Very Long Instruction Word (VLIW). • The compiler prearranges the object code so the VLIW chip can quickly execute the instructions in parallel • This frees up the microprocessor from having to perform the complex and continual runtime analysis that Super Scalar RISC and CISC chips must do.
VLIW vs Super Scalar • Super Scalar architectures, in contrast, use dynamic scheduling that transform all ILP complexity to the hardware • This leads to greater hardware complexity that is not seen in VLIW hardware • VLIW chips don’t need most of the complex circuitry that Super Scalar chips must use to coordinate parallel execution at runtime
VLIW vs Super Scalar • Thus in VLIW hardware complexity is greatly reduced • the executable instructions are generated directly by the compiler • they are then passed as “native code” by the functional units present in the hardware • VLIW chips can • cost less • burn less power • achieve significantly higher performance than comparable RISC and CISC chips
Tradeoffs • VLIW architecture still has many problems it must overcome • code expansion • high power consumption • scalability
Tradeoffs • Also the VLIW compiler is specific • it is an integral part of the VLIW system • A poor VLIW compiler will have a much more negative impact on performance than would a poor RISC or CISC compiler
History and Outlook • VLIW predates the existing Super Scalar technology, which has proved more useful up until now • Recent advances in computer technology, especially smarter compilers, are leading to a rebirth and resurgence of VLIW architectures • So potentially it could still have a very promising future ahead of it
Western Research Laboratory (WRL) Research Report 89/7 Available Instruction-level Parallelism for Superscalar and Superpipelined Machines By Norman P. Jouppi and David W. Wall
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. Ways of Exploiting Instruction-level Parallelism (ILP)
Load C1<- 23 (R2) Add R3 <- R3 +1 FPAdd C4 <- C4 + C3 Parallelism = 3 Add R3<-R3 +1 Add R4<-R3 + R2 Store 0 [R4] <- R0 Parallelism = 1 Example code fragments for ILP
A Machine Taxonomy • Operation Latency :- A time (in cycles) until the result of an instruction is available for use as operand in a subsequent instruction. • Simple Operations :- Operations such as integer add, logical ops, loads, stores, branches, floating point addition, multiplication are simple operations.Divide and cache misses are not. • Instruction class :- A group of instructions all issued to the same type of functional unit. • Issue Latency :- The time (in cycles) required between issuing two instructions.
Various Methods • The Base Machine • Instructions issued per cycle = 1 • Simple operation latency measured in cycles = 1 • Instruction-Level Parallelism required to fully utilize = 1 • Underpipelined Machines • Executes an operation and writes back the result in the same pipestage. • It has a cycle time greater than the latency of a simple operation or • it issues less than one instruction per cycle. • Superscalar Machines • Instructions issued per cycle = n at all times • Simple operation latency measured in cycles = 1 • Instruction-Level Parallelism required to fully utilize = n
Key Properties of VLIW Machines • VLIW have instructions hundreds of bits long. Each instruction can specify many operations, so each instruction exploits ILP. • The VLIW instructions have fixed format. The operations specifiable in one instruction do not exceed the resources of the machine, unlike superscalar machines. • In effect, the selection of which operations to issue in a given cycle is performed at compile time in a VLIW machine and at run time in a superscalar machine. • The instruction decode logic for VLIW machine is simpler. • The fixed VLIW format includes bits for unused operations. • VLIW machines that are able to exploit more parallelism would require larger instructions.
VLIW Vs Superscalar There are three differences between Superscalar versus VLIW instructions:- Decoding of VLIW instructions is easier than superscalar instructions. When the available instruction-level parallelism is less than that exploitable by the VLIW machine, the code density of the superscalar machine will be better. Superscalar machine could be object-code compatible with a large family of non-parallel machines, but VLIW machines exploiting different amounts of parallelism would require different instruction sets.
IFetch Decode Execute WriteBack Execution in a VLIW machine Key: Successive Instructions 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles
Class Conflicts • There are two ways to develop a superscalar machine of n degree from a base machine. • Duplicate all functional units of n times, including register ports, bypasses, busses, instruction decode logic. • Duplicate only the register ports, bypasses, busses, and instruction decode logic. These two method are extreme cases, and one could duplicate some units and not others. But if all functional units are not duplicated, then potential class conflicts will be created. A class conflict occurs when some instruction is followed by another instruction or the same functional unit.
Superpipelined Machines • Instructions issued per cycle=1, but cycle time is 1/m of the base machine • Simple operation latency measured in cycles=m • Instruction-level parallelism required to fully utilize=m Key: IFetch Decode Execute WriteBack Successive Instructions 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles Superpipelined execution(m=3)
Superpipelined Superscalar Machines • Instructions issued per cycle=n, but cycle time is 1/m of the base machine • Simple operation latency measured in cycles=m • Instruction-level parallelism required to fully utilize=n*m Key: IFetch Decode Execute WriteBack 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles Superpipelined Superscalar execution(n=3, m=3)
Vector Machines • Vector machines can also take advantage of ILP • Each of the machine could have an attached vector unit.It shows parallel execution of vector instructions. • Each vector instruction results in a string of operations, one for each element in the vector. Successive Instructions 3 6 7 8 9 10 12 13 14 1 2 4 5 11 Time in Base Cycles Superpipelined Superscalar execution(n=3, m=3)
Supersymmetry • A superscalar machine of degree three can have three instructions execting at the same time by issuing three at the same time. • The superpipelined machine can have three instructions executing at the same time by having a cycle time 1/3 that of superscalar machine, and issuing three instructions in successive cycles. So as far as supersymmetry is concerned,both superscalar and superpipelined machines of equal degree have basically the same performance.
Limits of Instruction Level Parallelism- Wall [91] How much parallelism is there to exploit?
Wall’s experimental framework of 18 test programs draws the following aspects • Data dependency:- Result of the instruction is the operand of the second instruction. • Anti-dependency:- The first instruction uses the old value in some location and the second sets that location to a new value. • Output dependency:- Both instructions assign value to the same location. • Control dependency:- This is between a branch and an instruction whose execution is conditional on it.
Wall’s experimental framework of 18 test programs draws the following aspects r1 : = 20 [r4] r2: = r1 + r4 ……… ……….. r2 : = r1 +1 r1:= r17 – 1 (a) True data dependency (b) anti-dependency r1 : = r2 * r3 if r17 : = 0 goto L ………… ……….. r1 : = 0 [r7] r1 : = r2 + r3 (c) output dependencies (d) control dependencies
Register Renaming • Anti-dependencies and output dependencies on registers are often accidents of the compiler’s register allocation technique. • Register renaming is a hardware method which imposes a level of indirection between the register number appearing in the instruction and the actual register used. • Perfect renaming (assume infinite no of registers) • Finite renaming (assume finite register set dynamically allocated using an LRU) • None (use the register specified in the code)
Alias Analysis • Like registers memory locations can also carry true and false dependencies but…........ • Memory is much larger than register file • Hard to tell when a memory-carried dependency exists • The register used by an instruction are manifest in the instruction itself,while memory location used is not manifest and may be different for different executions of the instruction. This may lead to assuming dependencies which are not leading to the aliasing problem. • Alias analysis types are :- • Perfect alias analysis • No alias analysis • Alias by instruction inspection • Alias analysis by compiler
Branch Prediction • Speculative execution:- • Parallelism within a basic block is usually limited, mainly because basic blocks are usually quite small. Speculative execution tries to mitigate this by scheduling instructions across branches • Branch Prediction : - • The hardware or the software predicts which way a given branch will most likely go, and speculatively schedules instructions from that path.
Example VLIW Processors “Automatic Exploration of VLIW Processor Architectures from a Designer’s Experience Based Specification” Dr.’s Auguin, Boeri & Carriere “VIPER: A VLIW Integer Microprocessor” Dr.’s Gray, Naylor, Abnous & Bagherzadeh
Example VLIW Processors • RISC architecture utilizes temporal parallelism whereas VLIW architecture utilizes spatial parallelism • Superscalar processors schedule the order of operations at run time demanding more hardware, VLIW schedule at run time making for simpler hardware paths • These large instruction words can be used to either contain more complex instructions or more instructions. • Requires more or larger registers to hold
Example VLIW Processors • Less hardware needed which leads to: • less power consumption • less heat • cheaper cost to make • How do you achieve the full speed of a VLIW chip? • Decoding of multiple instructions at once • More hardware • More complex compilers
Viper Processor • Executes four 32 bit operations concurrently • Up to 2 load/store operations at once • Less hardware on chip allows for up to 75% more cache • Greater cache performance means faster • To solve the compiler problem Viper uses only one ALU • Cheaper overall then a chip of similar speed • There is a greater cost of production due to new technology
Current Research • The focus of my area is the current research that’s taking place in relation to VLIW architectures • Roughly half of the latest research papers I examined had to do with some aspect of clustered VLIW architectures • Since this seems a very hot topic of research I chose two papers that I thought were most representative of this topic
Current Research • “An effective software pipelining algorithm for clustered embedded VLIW processors” by C. Akturan, M. Jacome September 2002. • “Application-specific clustered VLIW datapaths: Early exploration on a parameterized design space” by V. Lapinskii, M. Jacome, G. de Veciana August 2002.
Why clusters? • In order to take full advantage of instruction level parallelism extracted by software pipelining, Very Large Instruction Word (VLIW) processors with a large number of functional units (FU’s) are typically required • Unfortunately, architectures with centralized register file architectures scale poorly as the number of FU’s increases
Why clusters? • centralized architectures quickly become prohibitively costly in terms of • clock rate • power dissipation • delay • area • overall design complexity
Clusters • In order to control the penalties associated with an excessive number of register file (RF) ports • While still providing all functional units necessary to exploit the available ILP • We restrict the connectivity between functional units and registers
Clusters • We restructure a VLIW datapath into a set of clusters • Each cluster in the datapath contains a set of functional units connected to a local register file • The clock rate of a clustered VLIW machine is likely to be significantly faster than that of a centralized machine with the same number of FU’s
Good DataPath Configurations • The first paper is by Lapinskii. It tries to expose good datapath configurations among the different set of possible design choices • Break up the possible set of design decisions into design slices • Focus on parameters that have a first-order impact on key physical figures of merit • clock rate • power dissipation
Good DataPath Configurations • Each slice has the following properties (parameters) • cluster capacity • number of clusters • bus (interconnect) capacity • With their methodology they explore the different design decisions by varying these parameters
Software-Pipelining Algorithm • The next paper is by Akturan. It presents a software-pipelining algorithm called CALiBeR • CALiBeR takes code, loop bodies in particular, and reschedules it in such a way so as to take advantage of the inherent ILP • it than binds the instructions to a given clustered datapath configuration
CALiBeR • Although CALiBeR is made for compilers targeting embedded VLIW processors • it can be applied more generally • It can handle heterogeneous clustered datapath configurations • clusters with any number of FU’s • clusters with any type of FU’s • multi-cycle FU’s • pipelined FU’s