Intro to the “c6x” VLIW processor

Texas Instruments TMSC6000 series TMSC6700 subseries – include floating point VLIW = Very Long Instruction Word Intro to the “c6x” VLIW processor

Operations in Parallel registers Function units

Operations in Parallel registers bypassing Function units

Non-orthogonal registers registers Bypass Function units

Non-orthogonal B A registers registers Bypass Function units L1 S1 M1 D1 L2 S2 M2 D2 *** See TI's picture ***

Specialized Function Units • L units: arithmetic, compare, and logical ops • S units: arithmetic, logical, branches, constant generation • M units: multiplies • D units: address generation / memory accesses

Complicated hardware registers registers

Explicit parallelism registers registers

Simple VLIW encoding • Slots that cannot be utilized are filled with no-ops • Bad for code density, cache utilization, energy, ...

C6X: Packets • One bit of each instruction indicates whether next instruction can be executed in parallel (0 = “EOP”) • Any slot can go to any function unit 0 1 0 1 1 1 1 1

C6X: Packets • One bit of each instruction indicates whether next instruction can be executed in parallel • Any slot can go to any function unit 0 1 0 1 1 1 1 1

C6X: Packets • One bit of each instruction indicates whether next instruction can be executed in parallel • Any slot can go to any function unit 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 • Packet cannot cross an 8-word boundary • Resources constrain which instructions can be combined in the same packet • You can branch into the middle of a packet!

Explicit scheduling Delay slots must be respected – no HW interlocks or scoreboarding Multiply – 1 delay slot Load – 4 delay slots Branch – 5 delay slots B5 := B3 * B2 B5 := B3 * B2 B7 := B5 + B1 B7 := B5 + B1 Right Wrong

Predicated execution Why? To get rid of branches (5 delay slots * 8 wide ....) Basic idea: a comparison result is stored to a condition register ; this register is then used as an operand of other instructions, and its value causes those operations to be selectively enabled or squashed. [Condition registers: A1, A2, B0, B1, B2] Example: If (B3<B4) B3++ else B4++

Predicated execution With branches: With predicates: cmp B3, B4 bge L2 <nop> B3 := B3+1 b DONE <nop> L2: B4 := B4+1 DONE: cmplt B3, B4 B0 [B0] B3 := B3+1 [!B0] B4 := B4+1 ...and the last two canbe issued in parallel! Control dependencyhas been converted to data dependency...

Assembly details .text .align 32 .global proc proc: mvk 4, b3 mvk 5, b4 cmpgt b3, b4, b0 [ b0] mvk.S2 9, b5 || [!b0] mvk.S1 8, a5 stw a5, *-a15[4] .....

Fetch/execute pipeline PG generate program address PS program address send PW program memory access PR fetch reaches CPU boundary DP instruction dispatch DC instruction decode E1 execute 1 E2 execute 2 E3 execute 3 E4 execute 4 E5 execute 5

Addressing Modes C equivalent *R (*R) *+R[ucst5] (R[ucst5]) *-R[ucst5] (R[-ucst5]) *+R[offsetR] (R[offsetR]) *-R[offsetR] (R[-offsetR]) Special case: 15b offsets: *+B15[ucst15] *+B14[ucst15]

Addressing Modes Pre/post increment/decrement *++R , *R++ *++R[ucst5], *R++[ucst5] *--R[ucst5], *R--[ucst5] *++R[offsetR], *R++[offsetR] *--R[offsetR], *R--[offsetR]

Resources http://www.cs.cmu.edu/~tcal/15745/

Intro to the “c6x” VLIW processor