200 likes | 212 Views
Learn about TMSC6000 series processors, including the VLIW architecture, parallel operations, specialized function units, and explicit scheduling. Understand the unique features of the "c6x" VLIW processor and its operations in parallel registers and bypassing function units. Explore the complicated hardware registers, predicated execution, and assembly details of the processor.
E N D
Texas Instruments TMSC6000 series TMSC6700 subseries – include floating point VLIW = Very Long Instruction Word Intro to the “c6x” VLIW processor
Operations in Parallel registers Function units
Operations in Parallel registers bypassing Function units
Non-orthogonal registers registers Bypass Function units
Non-orthogonal B A registers registers Bypass Function units L1 S1 M1 D1 L2 S2 M2 D2 *** See TI's picture ***
Specialized Function Units • L units: arithmetic, compare, and logical ops • S units: arithmetic, logical, branches, constant generation • M units: multiplies • D units: address generation / memory accesses
Complicated hardware registers registers
Explicit parallelism registers registers
Simple VLIW encoding • Slots that cannot be utilized are filled with no-ops • Bad for code density, cache utilization, energy, ...
C6X: Packets • One bit of each instruction indicates whether next instruction can be executed in parallel (0 = “EOP”) • Any slot can go to any function unit 0 1 0 1 1 1 1 1
C6X: Packets • One bit of each instruction indicates whether next instruction can be executed in parallel • Any slot can go to any function unit 0 1 0 1 1 1 1 1
C6X: Packets • One bit of each instruction indicates whether next instruction can be executed in parallel • Any slot can go to any function unit 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 • Packet cannot cross an 8-word boundary • Resources constrain which instructions can be combined in the same packet • You can branch into the middle of a packet!
Explicit scheduling Delay slots must be respected – no HW interlocks or scoreboarding Multiply – 1 delay slot Load – 4 delay slots Branch – 5 delay slots B5 := B3 * B2 B5 := B3 * B2 B7 := B5 + B1 B7 := B5 + B1 Right Wrong
Predicated execution Why? To get rid of branches (5 delay slots * 8 wide ....) Basic idea: a comparison result is stored to a condition register ; this register is then used as an operand of other instructions, and its value causes those operations to be selectively enabled or squashed. [Condition registers: A1, A2, B0, B1, B2] Example: If (B3<B4) B3++ else B4++
Predicated execution With branches: With predicates: cmp B3, B4 bge L2 <nop> B3 := B3+1 b DONE <nop> L2: B4 := B4+1 DONE: cmplt B3, B4 B0 [B0] B3 := B3+1 [!B0] B4 := B4+1 ...and the last two canbe issued in parallel! Control dependencyhas been converted to data dependency...
Assembly details .text .align 32 .global proc proc: mvk 4, b3 mvk 5, b4 cmpgt b3, b4, b0 [ b0] mvk.S2 9, b5 || [!b0] mvk.S1 8, a5 stw a5, *-a15[4] .....
Fetch/execute pipeline PG generate program address PS program address send PW program memory access PR fetch reaches CPU boundary DP instruction dispatch DC instruction decode E1 execute 1 E2 execute 2 E3 execute 3 E4 execute 4 E5 execute 5
Addressing Modes C equivalent *R (*R) *+R[ucst5] (R[ucst5]) *-R[ucst5] (R[-ucst5]) *+R[offsetR] (R[offsetR]) *-R[offsetR] (R[-offsetR]) Special case: 15b offsets: *+B15[ucst15] *+B14[ucst15]
Addressing Modes Pre/post increment/decrement *++R , *R++ *++R[ucst5], *R++[ucst5] *--R[ucst5], *R--[ucst5] *++R[offsetR], *R++[offsetR] *--R[offsetR], *R--[offsetR]
Resources http://www.cs.cmu.edu/~tcal/15745/