LPC Speech Coder on the TI C6x DSP

LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede

Summary • Implementation platform • Texas Instruments TMS320C6000 • Low-quantity cost US $35 (‘C6211) • Architecture clock frequency • 150 MHz (‘C6211) • Throughput • 75-80 channels @ 8000 samples/sec

Summary • Total energy per sample • 1.8 uJ/sample • ‘Area’ • 1.2% of cycle budget per chan. per frame • 8.5% of unified memory per channel • 25% of unified memory for algorithm

Summary • Flexibility of implementation • High; programmable processor with C compiler, GUI debugger & simulator • SegSNR_A: • ? • SegSNR_Q: • 26 dB (voiced segments)

Architecture overview • 256-bit VLIW • Two “clustered” data paths • Four functional units in each data path • 16x16 multiply • Two ALUs • Data addressing unit • 32-bit instruction for each functional unit • (256 bit “instruction” for 8 func. Units)

Data path diagram

Architecture overview • Split register file • Only two cross-paths exists • Cluster is limited to one source read from opposite register file per cycle. • Data types • 8, 16, 32-bit with 40-bit accumulate • 40-bit = register pair

Memory architecture • ‘C6211 (US$35) has a cache! • 4kB L1 Instruction cache (L1P) • 4kB L1 Data cache (L1D) • 64kB L2 Unified memory and/or cache • Extra DMA channels

Memory architecture

Design Tools • Command-line • Compiler, debugger, simulator • Code Composer Studio • Same tools • Windows NT GUI • 30-day “evaluation” license • Draconian copy protection, pulls out the rug from under you

Design Flow • Consolidate Matlab reference into a single function • Matlab rewritten C-style • Verified C-style Matlab • C prototype created • Imported into Code Composer, optimized & simulated

Fixed-point quantization • Input samples • 16-bit, normalized to [-1,1) • <1.15> format used • Coefficient quantization • Hamming window, pre-emphasis, FIR • <1.15> format used • No noticeable change in characteristics

Fixed-point quantization • Most values 16 bit • Take advantage of 16x16 fast multipliers • Remain close to other class implementations • Add metric for overpowered LPC engine • Use # of channels as performance metric

Fixed-point quantization • Energy stored in <5.27> • Prevent overflow, provide precision for low energy segments • Temporary values stored in <10.30> • Take advantage of extended precision • Modified autocorrelation used <16.0> • All whole numbers

Fixed-Point SNR • Matlab simulation of magnitude truncation • Tools again. • SegSNR_A = ? • SegSNR_Q = 26 dB • Voiced segments only • Sent_female test data

Performance results • Initial version: 80,000 CPU cycles/frame • Optimization • Take advantage of VLIW, pipelining • observe assembly, modify C loops • Use TI’s DSP Library • Assembly advantage without assembly • Optimized version: 30,182 cycles/frame • Had to stop early, still at least 5K cycles wasted

Performance • Then, the tool license expired. • The tool would not install on other machines. • TI responded, but wasn’t too helpful. • Moral #1: Avoid the evaluation version. • Moral #2: Give tools away to sell hardware

Cycle count details

Additional optimizations • Use more DSPLIB routines • Autocorrelation • Assembly-level optimization • Code size reduction? • Reduce number of buffers to reduce L1D usage per frame

Energy per sample • ‘C6211 consumes 1.24W • 75% high activity / 25% low activity • 1.24W / 80 channels = 15.5mW/channel • 15.5 mJ/sec/channel * 1/8000 = 1.8 uJ / sample

Number of channels 150 x 106 cycles/sec x 0.02 sec/frame = 3.0 x 106 cycles/frame 3.0 x 106 cycles/frame / 30,182 cycles = 99 channels

Memory • ‘C6211 Cache complicates estimates • Performance is 85-99% of optimal for typical applications • 30,182 cycles becomes 35,508 cycles/frame for 85% efficiency=> now support only 86 channels

Memory • Try to account for off-chip memory transfers • ~220,000 cycles for 150ns fetches for 80 channels=> support 75-80 channels • Unable to verify/simulate because of unexpected tool expiration

Memory • L2 usage • ~16kB Code size thanks to VLIW • 512 32-byte instruction clusters • More suited for ‘C6201 & larger processors • Remaining used by data for channels • 480 bytes each (8.5% of remaining memory) • L1 usage • L1P: Can’t tell because of cache • L1D: 2.2kB (~56%)

Tool comments • Powerful, easy to use IDE… • When it worked. • Licensing problems for eval version • Debugging support a bit odd • puts/printf

C6x Conclusions • Easily support 75-80 channels of coding • 26 dB fixed-point SNR, 16-bit types • VLIW = Large code size • Cache on a low-end DSP! • Good tools,but draconian copy protection

LPC Speech Coder on the TI C6x DSP

LPC Speech Coder on the TI C6x DSP

Presentation Transcript

coder

1. TMS320C6X DSP Programming with Simulink – TI C6000 DSP Target

Promotional Coder

Overview of Popular DSP Architectures: TI, ADI, Motorola

Speech Coding Using LPC

Coder Meeting

Linux C6x Syslink

Coder Meeting

Matrices on the TI

Chapter 6 Linear Predictive Coding (LPC) of Speech Signals

Overview of Popular DSP Architectures: TI, ADI, Motorola

on the TI series calculators

Jets at the LPC

Architecture of the C6x Processor

Productive Coder

Intro to the “c6x” VLIW processor

A DSP with Caches: A Study of the GSM-EFR Codec on the TI C6211

DSP Implementation on FPGA

Statistics on the TI-84

Intro to the “c6x” VLIW processor

na ti on no ti on sta ti on pa ti ence collec ti on ambi ti on