Lecture 10b: Implementing DSP Functionality: Alternatives

Lecture 10b: Implementing DSP Functionality:Alternatives Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang Kurt Keutzer

System Implementation Choices System Functionality DSP Program ROM Program ROM ASIP Core DSP Core ASIC OFF-THE SHELF µP/ DSP Coefficient ROM Control Coefficient ROM Control EMBEDDED CORE µP/DSP APPLICATION SPECIFIC µP (ASIP) Kurt Keutzer

Making a Successful Comparison - 1 • Find an interesting application kernel • viterbi decoding for speech processing (not a full modem!) • Find realistic constraints native to the application • n=2, K=7, QPSK, 100KBS, BER= 10^-4 • Find architectures/implementations that are promising for the application • TI TMS320C54, Tensilica Xtensa • What are the relevant features of this architecture that support this application? • Fix application constraints across all implementations (above) • Fix key parameters for implementation comparison • performance (constraint) • area • power Kurt Keutzer

Making a Successful Comparison - 2 • Identify how key parameters will be measured • performance - instruction set simulator, eval board • area - data sheets, gate estimates • power - eval board, TI application note • Implement your application kernel • Examine different algorithms • Start with code downloaded from the web - multimedia benchmarks etc. • Build your software development/evaluation environment: • http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm Kurt Keutzer

Making a Successful Comparison - 3 • Implement your application kernel (cont) • Phase 0: Research • Find application notes, research reports for your own or comparable architectures • Phase 1: Estimation • Develop a quick estimate based on initial code • Integrate research findings • Do a quick back-of-envelope reality check • Phase 2: Real implementation/Tuning • Tailor algorithm, implementation to architecture • Do your very best! Have a contest with your partner • Phase 3: Evaluation • Apply evaluation tools to key parameters • Evaluate and compare results - return to 2 • If your life depended on choosing the right part - what would you do? Kurt Keutzer

Making a Successful Comparison - 4 • Final evaluation and comparison - compare all implementations • To evaluate for a product - everything is fair game • To evaluate principally the architectures - need to consider: • Fab differences - TSMC vs. IBM (10-20% faster) • process differences - .35 micron vs. .25 (50% faster) • power supply differences 3.0V vs. 1.5V • asic vs. custom implementations - (2x faster) • Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently? • cache sizes • register availability • additional instructions • on chip memory Kurt Keutzer

Making a Successful Comparison - 5 • Just for fun … • In addition to primary constraints (speed, cost, power) • final real world considerations • business relationships (joint partnership with Lucent) • Time-to-market issues • time to configure? • software development environment • library/application software support • application engineering support Kurt Keutzer

Viterbi Algorithm Prof. Heinrich Meyr University of Aachen Kurt Keutzer

Viterbi Decoders in digital communication systems Kurt Keutzer

Convolutional Coder and Trellis diagram Kurt Keutzer

ACS recursion for M = 2 Kurt Keutzer

Viterbi Decoder block diagram Kurt Keutzer

Characteristic of a 2-bit step-at-zero quantizer Kurt Keutzer

Architecture Kurt Keutzer

Node parallel ACS architecture Kurt Keutzer

Alternative Implementations Kurt Keutzer

Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code Kurt Keutzer

Survivor Memory Unit Kurt Keutzer

REA hardware architecture Kurt Keutzer

Decoded Sequence: 0 0 ... 0 1 0 Kurt Keutzer

uncoded word length = 1 coded word length (n) = 2 this means that it is rate 1/2 constraint length (K aka. L) = 7 this means that the number of states in trellis is 2^(K-1) or 64 states branch metric calculation is QPSK soft decision wordlength (q) = 6 chain-backing depth (D) = 96 generator polynomials: p0 = 171, p1= 133 (octal) this means that p0=1111001, p1=1011011 data rate 100 kbs goal: bit error rate (BER) = 10^-4 signal to noise ratio (SNR) degradation 0.05dB Viterbi Project Constraints Kurt Keutzer

Viterbi Decoder Implementation on an ARM EE 290S Final Project May 4, 1999 Phillip Chong Kurt Keutzer

ARM Overview • 32-bit RISC microprocessor • Five stage pipeline • Features fast ALU operations (barrel shifter) • Scalar integer unit, no FPU Kurt Keutzer

Algorithm Tweaking • Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots) • Parity computation (Viterbi code) can also be done through table lookup Kurt Keutzer

Reducing Memory Footprint • Cache misses can be very costly due to pipeline stalls • We are willing to give up some algorithmic efficiency to eliminate cache misses • To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation) • For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes Kurt Keutzer

Simulation Results • Simulated decoding of 4096 bits on a 125 MHz 3.3V model • Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate • Power consumption was estimated at 52.47 mW • Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW Kurt Keutzer

Summary • Clock speed: 275 MHz • Execution Performance: 96kb/s • Power Dissipation: 42.40 mW (5.68 mW/mm2) • Area: 7.47mm2 in 0.25 m • Design Effort: 4 days • Portability very high: code is ANSI C; architecture-dependent tweaks may need reworking Kurt Keutzer

Conclusion/Thanks • One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR • Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available • Many thanks to Marlene Wan for providing power estimation Kurt Keutzer

Viterbi Decoder Implementation on a TI C54x EE 290S Final Project May 4, 1999 Paul Husted Kurt Keutzer

Introduction • Implemented Viterbi Decoder on a TI TMS320VC5402 DSP • Examine: • Performance (bits/sec) • Power (mW/bit) • Cost ($/unit,area) • Design effort (engineer-months) Kurt Keutzer

Viterbi Decoder Specifications • Implementation Specifications: • Constraint Length (K aka. L) = 7 • Branch Metric Calculation is QPSK • Soft Decision Wordlength (q) = 6 • Chain-backing Depth (D) = 96 • Gen. Polynomials: p0 = 171, p1= 133 (octal) • Data Rate 100 kbs • Goal: Bit Error Rate (BER) = 10^-4 Kurt Keutzer

C54x Capabilities • Capabilities of all C54x DSP Cores: • Three 16-bit Data, One 16-bit program bus • 40 bit ACC with 40 bit barrel shifter • Two independent accumulators • A single cycle non-pipelined MAC • Single-instruction repeat and block-repeat • Six channel DMA controller • Arithmetic instructions with parallel store and parallel load Kurt Keutzer

Helpful Instructions for the Viterbi Decoder • The C54x Has Specialized Instruction Set • Dual Add/Subtract in 1 Cycle • Compare, Select, and Store Unit (CSSU) • Compare Branch Metrics • Store Larger Value, Store Decision Bit • Increment Address Registers in Circular Buffer • 1 Cycle • Allows Butterfly (2 States) in 5 cycles Kurt Keutzer

Butterfly Implementation T Register = Local Distance Old(2*j) New(j) DADST CMPS DSADT CMPS Old(2*j+1) New(j+2(K-2)) Kurt Keutzer

TI TMS320VC5402 DSP • Specific Chip Characteristics: • Operates at 100 MIPS • Core Voltage of 1.8V • I/O Pins Operate at 3.3V • 16K Word x 16 Bits of Dual-Access RAM • 4K Word x 16 Bits of ROM • Internal DMA • Created in 0.18 Micron Technology Kurt Keutzer

Dataflow • Data I/O • Input Values Assumed to be Placed at Specified Memory Location by Internal DMA • Output Values Assumed to be removed from another Memory Location by Internal DMA • Alternatively, Data Could be Placed in this Memory Location After Other On-Chip Receiver Processing Kurt Keutzer

Implementation Analysis • Viterbi Decoder Code Created in Assembly • Linked to Processor Specific Memory Map • Simulated on Cycle-Accurate Simulator • Used Correct Memory Model for VC5402 Kurt Keutzer

Implementation Results Kurt Keutzer

Power Calculation • Compared with TI Figures: • TI uses 1/2 MACs, 1/2 NOPs For Power Figure • .25 Micron Estimate is .45 mA/MIPS • Fully Static Design can be Clocked at Any Rate • Viterbi Code Uses 1.08 Times More Current than TI Estimate • At 22 MIPS, 19.25 mW are Consumed in the Core Kurt Keutzer

Area Estimate • TI Will Not Release Die Sizes • .25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on a 144 pin BGA • Maximum Die Size is thus 10.24 mm2 Kurt Keutzer

Development Cost • Engineering Time • Estimate - 3 days • Assumes Engineer Has Experience with Assembly Language and TI Tools • Tool Cost - $13262.45 • Includes Emulator, Simulator, Compiler, Assembler, Linker, Debugger • Cost of Chip - $8.52 Kurt Keutzer

Conclusion • Optimized Instructions Make Algorithm Efficient • Static Design Allows Clock Rate to be Set As Needed to Reduce Power • Flexibility Exists to Perform Other Processing of Data • Very Little Development Time/Cost Kurt Keutzer

ACS TIE Extension with State (ACS) Rt Rs 1:0 0:1 11 17 27 31 31 27 17 11 31 24:23 16:15 8:7 0 pm- pm- bm3 bm2 bm1 bm0 pm- pm- + + msb msb + - =1? =1? - + Control 0:1 11 16:17 27 31 instruction decision bit pm pm decision bit Rr Kurt Keutzer

Tensilica Viterbi Implementation Niraj Shah Scott Weber 290A Final Presentation Kurt Keutzer

Tensilica Flow .c .c .c TIE uArch xt-gcc gen Designer Tensilica Processor Generator gen xt-run .o Kurt Keutzer

Xtensa Architecture • TIE Extensions: • single cycle • state free • no new exceptions • no stalls • typeless data • Rs, Rt, Rr are 32 bit regs • I is the instruction controlling the TIE unit • Xtensa Core is a 32 bit configurable RISC processor Xtensa Core Rs Rt I Rr TIE Kurt Keutzer

Viterbi Architecture ADC I/0 Device Init RAM TraceBack ACS Measured Performance Here Kurt Keutzer

TIE SetupBMreg (ACS) Rs Rt 31 8:7 0 31 8:7 0 I Q 0x7F + - + - - Control instruction bm0 bm1 bm2 bm3 0 7:8 15:16 23:24 31 Rr Kurt Keutzer

ACS TIE Extension (ACS) Rs Rt 31 27 17 11 1:0 31 24:23 16:15 8:7 0 pm- pm- bm3 bm2 bm1 bm0 + msb =1? - + ACS03 || ACS12 || ACS30 || ACS21 instruction 0:1 11:12 31 0’s pm decision bit Rr Kurt Keutzer

ACS TIE Extension with State (ACS) Rt Rs 1:0 0:1 11 17 27 31 31 27 17 11 31 24:23 16:15 8:7 0 pm- pm- bm3 bm2 bm1 bm0 pm- pm- + + msb msb + - =1? =1? - + Control 0:1 11 16:17 27 31 instruction decision bit pm pm decision bit Rr Kurt Keutzer

Lecture 10b: Implementing DSP Functionality: Alternatives