TI C6701 VLIW MIMD

TI C6701 VLIW MIMD

Presentation Outline • Introduction / Overview • Differentiating Features • Assembly Syntax • Instruction Flow • Pipelining and Optimization • Conclusion

Introduction • TI’s C6000 family • VLIW architectures • Flexibility from Software

Architecture VLIW FPU Yes MFLOPs (Peak) 1000 16x16 MACs (MMAC/s) 334 8x8 MACs (MMAC/s) 334 MIPS (Peak) 1336 MOPS (Peak) 336 Memory Bus Bandwidth (MB/s) 332 1K FP cfft (µsec) 108 1K 16 bit cfft (µsec) 108 1K FP dot product (µsec) 3.07 1K 16 bit dot product (µsec) 3.07 512 2 xFP Conv3x3 (msec) 7.11 512 2 x8 bit Conv3x3 (msec) 7.11 512 2 x8 bit Erosion/Dilation (msec) 3.62 Characteristics Chart Figure 1: TI Data Sheet

Basic Overview • Eight 32-bit instructions fetched per clock cycle, called a fetch packet • Two CPU multipliers , Six ALUs for execution. Two general-purpose register files (A and B), • Eight functional units (.L1, .L2, .S1, .S2, .M1, .M2, .D1, and .D2), • Two load-from-memory data paths per register file (LD1a, LD1b, LD2a, LD2b), • Two data address paths (DA1 and DA2), and • Two register file data cross paths (1X and 2X)

Architecture Overview

Differentiating Features • The features that differentiate the TI from other VLIW architectures are: • Instructions that can be of varied length • Predication in all instructions • Pipelining of the branch functions

Assembly Syntax • Label • Parallel Bars • Conditions • Instruction • Functional Unit • Operands • Comments

Assembly Example

Instruction Flow • Eight functional units - two separate groups of four • Each group has a separate data path and splits the general-purpose registers the two units are named .L1 and .L2, .M1 and .M2, .S1 and .S2, and .D1 and .D2 • The .L units are responsible for • Logical operations • Data packing and unpacking • Some arithmetic.

Instruction Flow • 32 General Purpose Registers • 64 Bit Operations using the LDDW instruction • LD1a manages the least-significant 32 bits and LD1b handles the most-significant 32 bits • The .D units are joined so that we can look at either register file for data, regardless of where the data address came from

Instruction Flow • Fetch Packets occur at boundaries of 256-bit intervals • Important! An execute packet can’t cross the fetch packet boundary • The execute packet for parallel instructions is created by looking at the first bit in the instruction (The P bit) • Maximum of eight instructions executed in parallel.

Architecture Overview

Pipelining & Optimization • The C6701 doesn’t have the ability to look ahead and schedule • The number of instructions in the execute packet is the key to optimizing the code • The number of clock cycles used in executing an instruction is called the number of delay slots • Multiple cycle instructions will have significant effects on the delay slot count of an instruction

Pipelining & Optimization • Possible to have an execute packet that contains NOPS. • By using multiple NOPS in parallel with a multi-cycle instruction we will make the next execute packet capable of using the previous multi-cycle instruction result • If we use a cross-path during a multi-cycle instruction then we can’t use that cross path again until the instruction has finished

Execution Pipeline

AD vs. TI vs. Motorola

Conclusion • The C6701 allows scheduling of instructions in the assembly code • Unfortunately, a good understanding of the hardware is still necessary to be able to schedule instructions in an optimized way • Thank You

TI C6701 VLIW MIMD

TI C6701 VLIW MIMD

Presentation Transcript

Introduction to MIMD architectures

The EPIC-VLIW Approach

TI C6701 VLIW MIMD

VLIW Architecture

ILP: VLIW Architectures

VLIW und EPIC

MIMD Computers

MIMD

MIMD Distributed Memory Architectures

MIMD

Cache coherence, etc… - MIMD –

MIMD COMPUTERS

Heterogeneous Clustered VLIW Microarchitectures

VLIW

VLIW Processors

MIMD Shared Memory

Heterogeneous Clustered VLIW Microarchitectures

Parallel MIMD Algorithm Design

VLIW Computing