Implementing the Viterbi algorithm on programmable processors

Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696 sridhar@rice.edu

Motivation • Viterbi decoding - One of the major bottlenecks in baseband processing [PHY] • Need for flexibility in the algorithm parameters due to different protocols “read programmable” • No architecture developed yet to meet real-time requirements of 3G systems. • 2 - 8 Mbps range for wideband CDMA • 100 Mbps range for wireless LAN

Today • Background • Advanced DSP architectures -- TI C6x [15] • Viterbi algorithm basics [10] • Viterbi on TI DSPs [10] • A programmable processor specifically designed for Viterbi [15]

4-wide VLIW Inst 1 Inst 2 Inst 4 Inst 3 FU 3 FU 2 FU 1 FU 4 TI C6x architecture • VLIW [Very Long Instruction Word] arch. • Similar to a vector processor -- but • multiple instructions -> multiple Func. Units • FU’s are not all the same • 32-bit architecture • 8 functional units

8 VelociTI principles • Parallel fetch, decode and execute • Pipelined enough to make ADD critical path • Instructions based on RISC • Load - Store architecture • Orthogonal - Instruction Set and Reg. File • Determinism • Conditional Instructions • Instruction Packing

2 * 4 = 8 Functional Units • .M Multiplication unit • 16 bit x 16 bit signed/# packed/# • .L arithmetic Logic unit • Comparisons and logic operations • Saturation arithmetic and absolute value • .S Shifter unit • Bit manipulation (set, get, shift, rotate) • Branching, addition and packed addition • .D Data unit • Load/store to memory • Addition and pointer arithmetic

How powerful am I? • 8 instructions per cycle • Max: • 6 adds per cycle • 2 multiplies per cycle • 2 load/stores per cycle • 2 branches per cycle • Idea is you will be using instructions in these ratios to get full FU utilization.

C6x DSP Core

C6x Datapath

C6x Resource Constraints • Instructions using the same FU • 1 inst. / FU • Cross Paths • only 1 operand from other reg. file to (L,S,M) • Loads and stores • 2 loads and stores from 2 different reg. files • Reads and writes • max 4-reads from the same register • No 2 writes to the same register :)

A E B F D H C G 1 0 0 1 0 1 0 1 Instruction Packing • Fetch Packet • Execute Packet • Avoid NOPs in the instruction code • Multi-cycle NOPs if absolutely necessary • LSB- “p” bit of instruction for packing A || B || C ,D || E, F, G || H 8 instructions instead of 32

Conditional Instructions • All instructions can be conditioned based on the value in registers A1,A2,B0,B1,B2 • Avoids branch latencies • If condition not met by end of first phase of execution, results not written back to reg. file • Conditional loads/stores squashed before data phase

C6x Pipeline • Fetch (if necessary) - 4 phases • Address Generate • Address Send • Access Ready Wait • Fetch Packet Receive • Decode - 2 phases • Instruction dispatch (if necessary) • Instruction decode • Execute - 10 phases • Most 1 phase

Some interesting instructions • Saturation • Bit-counting -- Image coding • Integer-comparison • Bit-manipulation • Seed generation for reciprocal instructions

Other details • 64 KB internal program and data • DMA - peripherals to memory • Intrinsics in code for better programming • similar to using “ViS” in UltraSPARC • Software pipelining of loops • PERFORMANCE: • 5-10X • higher clock -- higher pipeline (2-4X) • Additional ALUs

Additional features in C64x • SIMD support • Communication-specific instructions • interleaving, galois field multiply • Bit count and rotate hardware • 64 32-bit registers • Lower resource constraints • No more NOPs needed ever [no boundaries]

C64x DSP Core

Viterbi Decoding k n n > k k Decoder Encoder Rate k/n = 1/2 Convolutional Encoder

Error Protection • States = 2^(FFs) = 2^(Constraint Length - 1) • Cannot go from any state to any state

Trellis for decoding

Trellis for an input sequence

Error detection • Branch metric = “Distance” between received symbol pair and possible symbol pairs • Path metric = Accumulated error metric

Error-correction

Stages in Viterbi Decoding • Calculate Branch metrics for all states every stage • Update Path metrics for all states every stage • At the end, Traceback the trellis to get the decoded bits

Computations • Branch metrics: • Hamming distance: (XOR) and Count 1’s • Euclidean distance: squared distance • Path metrics: • Add Branch metrics to existing path metrics • Compare for minimum and Select minimum • Survivor Traceback: • Linked list /Pointer chasing • Memory Intensive / Sequential Operations

Viterbi support in different processors • C54x • Special hardware accelerator • ACS unit with 2 ACC and split ALU • Viterbi butterfly (2 ACS) in 4 cycles • C62x • nothing special • C6416 • Viterbi coprocessor • K = 5-9,Rate = 1/2,1/3,1/4

Viterbi Coprocessor in C6416

Viterbi Coprocessor in C6416 • SM, SD and HD memory not accessible to DSP

Need for VSP architecture • Large amount of memory access • Traceback decoding • Not efficient on a GPP • Program instructions in a GPP is of a higher order than complexity of the algorithm

VSP architecture

Branch Metric Calculation

Path Metric Calculation

Traceback Unit

Traceback with survivor updates Start Filling the Trellis 5*Constraint Length Start Traceback Update Survivor Path for most recent symbol Symbol Decoded

Survivor Path Updates

Circular updates

Software Programming • Small but specialized instruction set • LOAD, ACS • Shorter execution time • All 3 subprocessors programmed independently • 10 ns, (100 MHz) in 1990 to get 1.5 Mbps

Conclusions • Viterbi algorithm important for implementation in a programmable communication receiver • Approaches have been as co-processor support to DSPs or specialized processors. • We are yet to design programmable processors that meet real-time requirements for 100 Mbps applications.

Implementing the Viterbi algorithm on programmable processors