200 likes | 379 Views
CS718 : VLIW - Software Driven ILP. Example Architectures 6th Apr, 2006. Execution model - some issues. Register access within an instruction interaction between reads and writes within an instruction to the same register Operation completion under exception
E N D
CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006 Anshul Kumar, CSE IITD
Execution model - some issues • Register access within an instruction • interaction between reads and writes within an instruction to the same register • Operation completion under exception • which operations are completed when an exception occurs • Exposing pipeline latencies • what latency information the compiler has Anshul Kumar, CSE IITD
Register access in an instruction • Read sees the original value of the register • allows swap of two registers in a single instruction • Read sees the value written by the write • a pair of operations that read and write a pair of registers can not be resolved • Different operations that read and write the same register in an instruction are not allowed • parallel operations are not forced to execute in parallel Anshul Kumar, CSE IITD
None complete All that can complete or all before the excepting operation complete Free-for-all Simplest Complex (determine what remains to be fixed up) No guarantees Operation completion under exception Anshul Kumar, CSE IITD
Exposing pipeline latencies • EQ model • the destination is written in a cycle which is known at compile time • LEQ model • more permissive, allows some binary compatibility Anshul Kumar, CSE IITD
VLIW Examples • IA-64 and Itanium: HP and Intel • Trimedia: Philips • Transmeta Crusoe • DSPs: Texas Instruments, Analog Devices Anshul Kumar, CSE IITD
IA-64 Register Model • 128 general purpose registers 64 bit • 128 floating point registers 82 bit • 64 predicate registers 1 bit • 8 branch registers (indirect branch) 64 bit • Registers for system control, memory mapping, performance counters, communication with OS Anshul Kumar, CSE IITD
Register Stack • GPRs 0-31 always available • GPRs 32-127 used as a stack • GPRs and FPRs support register rotation for SW pipelining OUT LOCAL (frame i) OUT LOCAL (frame i -1) Anshul Kumar, CSE IITD
IA-64 Execution Units ExecutionInstructionDescription UnitType I-unit A Arithmetic (integer) I non-ALU int (shifts, tests, move) M-unit A Arithmetic (integer) M Memory (load/store) F-unit F Floating point B-unit B Branches, calls, loops L+X L+X Extended immediates (executed by either B or I units) Anshul Kumar, CSE IITD
Flexibility + explicit parallelism • Compiler forms groups of instructions which can be executed in parallel if execution resources are available • Instructions in a group may be scheduled in one or more cycles, depending upon resource availability Anshul Kumar, CSE IITD
Instruction Formats • Instructions are encoded in 128 bit bundles • Each bundle = 5 bit template + 3 41 bit instruction • 5 bit template field specifies execution unit types required for the 3 instructions and position of stops, if any • stops indicate the boundaries of instruction groups Anshul Kumar, CSE IITD
Template examples TemplateSlot 0Slot 1Slot 2 0 M I I 1 M I I 2 M I I 3 M I I 4 M L X 5 M L X 8 M M I 9 M M I Anshul Kumar, CSE IITD
Example Schedule 1 TemplateSlot 0Slot 1Slot 2Cycle 9: MMI LD F0,0(R1) LD F6,-8(R1) 1 14: MMF LD F10,-16(R1) LD F14,-24(R1) ADD F4,F0,F2 3 15: MMF LD F18,-32(R1) LD F22,-40(R1) ADD F8,F6,F2 4 15: MMF LD F26,-48(R1) SD F4,0(R1) ADD F12,F10,F2 6 15: MMF SD F8,-8(R1) SD F12,-16(R1) ADD F16,F14,F2 9 15: MMF SD F16,-24(R1) ADD F20,F18,F2 12 15: MMF SD F20,-32(R1) ADD F24,F22,F2 15 15: MMF SD F24,-40(R1) ADD F28,F26,F2 18 28: MFB SD F28,-48(R1) ADD R1,R1,-56 BNE R1,R2,Loop 21 Anshul Kumar, CSE IITD
Example Schedule 2 TemplateSlot 0Slot 1Slot 2Cycle 8: MMI LD F0,0(R1)LDF6,-8(R1) 1 9: MMI LDF10,0(R1)LDF6,-8(R1) 2 14: MMF LDF18,-16(R1) LDF14,-24(R1)ADD F4,F0,F2 3 14: MMF LDF26,-16(R1) ADD F8,F10,F2 4 15: MMFADD F12,F14,F2 5 14: MMFSDF4,0(R1)ADD F16,F18,F2 6 14: MMFSDF8,-8(R1)ADD F20,F14,F2 7 15: MMFSDF12,-16(R1)ADD F24,F22,F2 8 14: MMFSDF16,-24(R1)ADD F28,F26,F2 9 9: MMISDF20,-32(R1)SDF24,-40(R1) 11 28: MFBSDF28,-48(R1)ADDR1,R1,-56BNE R1,R2,Loop 12 Anshul Kumar, CSE IITD
Predication Support • Almost all instructions predicated • 6 bit field specifies predicate register • Predicate registers are set by test instructions Anshul Kumar, CSE IITD
Speculation Support • Control speculation using poison bit approach • One additional bit in GPRs - NaT (not a thing) • NaTVal in FPRs • Registers with NaT or NaTVal can’t be stored • special instructions to save and restore registers with poison bits/values • Load/store speculation using advanced load instruction and ALAT table with associative look up Anshul Kumar, CSE IITD
Itanium Processor • Introduced in 2001 with 800MHz clock • 3 level cache: first split, first 2 on-chip • 2 I units, 2 M units, 3 B units, 2 F units • 10 stage pipeline • pre-fetch buffer with 8 bundles : 2 bundles pre-fetched per cycle • up to 2 bundles issued at a time: up to 6 instructions distributed to 9 execution units, with register renaming (rotation and stacking) • Good FP performance but not integer Anshul Kumar, CSE IITD
Trimedia TM32 • Designed for embedded applications • Classic VLIW architecture, completely static scheduling • 5 operation slots per instruction • each specifies an operation or immediate field • no hazard detection hardware • compressed code stored in memory and cache, decompressed during fetch • each operation can be individually predicated • in an instruction with multiple branches, at most one predicate can be true • no virtual memory Anshul Kumar, CSE IITD
Trimedia Function Units • 23 function units of 11 different types • min latency 0 (integer ALU) • max latency 16 (FP divide and square root) • a function unit can be specified by only certain instruction slots • ALU (all), DMem (4, 5), Branch (2, 3, 4), DSPALU (1, 3), FALU (1, 4), FTough (2) Anshul Kumar, CSE IITD
Transmeta Crusoe • Designed for low power applications like mobile PC, mobile internet appliances • compatibility with x86 through translating software • 500 MHz to 1 GHz, 5 to 7 W power consumption • 64 bit (2 operations) and 128 bit (4 operations) versions, 64 integer registers [new 256 bit Efficeon] • Operation slot types: ALU, compute (int/fp/mm), Memory, Branch, Immediate • Support for speculative re-ordering: shadow register file, program-controlled store buffer, memory alias detection, conditional move Anshul Kumar, CSE IITD