260 likes | 478 Views
LPC Speech Coder on the TI C6x DSP. Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede. Summary. Implementation platform Texas Instruments TMS320C6000 Low-quantity cost US $35 (‘C6211) Architecture clock frequency 150 MHz (‘C6211) Throughput
E N D
LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede
Summary • Implementation platform • Texas Instruments TMS320C6000 • Low-quantity cost US $35 (‘C6211) • Architecture clock frequency • 150 MHz (‘C6211) • Throughput • 75-80 channels @ 8000 samples/sec
Summary • Total energy per sample • 1.8 uJ/sample • ‘Area’ • 1.2% of cycle budget per chan. per frame • 8.5% of unified memory per channel • 25% of unified memory for algorithm
Summary • Flexibility of implementation • High; programmable processor with C compiler, GUI debugger & simulator • SegSNR_A: • ? • SegSNR_Q: • 26 dB (voiced segments)
Architecture overview • 256-bit VLIW • Two “clustered” data paths • Four functional units in each data path • 16x16 multiply • Two ALUs • Data addressing unit • 32-bit instruction for each functional unit • (256 bit “instruction” for 8 func. Units)
Architecture overview • Split register file • Only two cross-paths exists • Cluster is limited to one source read from opposite register file per cycle. • Data types • 8, 16, 32-bit with 40-bit accumulate • 40-bit = register pair
Memory architecture • ‘C6211 (US$35) has a cache! • 4kB L1 Instruction cache (L1P) • 4kB L1 Data cache (L1D) • 64kB L2 Unified memory and/or cache • Extra DMA channels
Design Tools • Command-line • Compiler, debugger, simulator • Code Composer Studio • Same tools • Windows NT GUI • 30-day “evaluation” license • Draconian copy protection, pulls out the rug from under you
Design Flow • Consolidate Matlab reference into a single function • Matlab rewritten C-style • Verified C-style Matlab • C prototype created • Imported into Code Composer, optimized & simulated
Fixed-point quantization • Input samples • 16-bit, normalized to [-1,1) • <1.15> format used • Coefficient quantization • Hamming window, pre-emphasis, FIR • <1.15> format used • No noticeable change in characteristics
Fixed-point quantization • Most values 16 bit • Take advantage of 16x16 fast multipliers • Remain close to other class implementations • Add metric for overpowered LPC engine • Use # of channels as performance metric
Fixed-point quantization • Energy stored in <5.27> • Prevent overflow, provide precision for low energy segments • Temporary values stored in <10.30> • Take advantage of extended precision • Modified autocorrelation used <16.0> • All whole numbers
Fixed-Point SNR • Matlab simulation of magnitude truncation • Tools again. • SegSNR_A = ? • SegSNR_Q = 26 dB • Voiced segments only • Sent_female test data
Performance results • Initial version: 80,000 CPU cycles/frame • Optimization • Take advantage of VLIW, pipelining • observe assembly, modify C loops • Use TI’s DSP Library • Assembly advantage without assembly • Optimized version: 30,182 cycles/frame • Had to stop early, still at least 5K cycles wasted
Performance • Then, the tool license expired. • The tool would not install on other machines. • TI responded, but wasn’t too helpful. • Moral #1: Avoid the evaluation version. • Moral #2: Give tools away to sell hardware
Additional optimizations • Use more DSPLIB routines • Autocorrelation • Assembly-level optimization • Code size reduction? • Reduce number of buffers to reduce L1D usage per frame
Energy per sample • ‘C6211 consumes 1.24W • 75% high activity / 25% low activity • 1.24W / 80 channels = 15.5mW/channel • 15.5 mJ/sec/channel * 1/8000 = 1.8 uJ / sample
Number of channels 150 x 106 cycles/sec x 0.02 sec/frame = 3.0 x 106 cycles/frame 3.0 x 106 cycles/frame / 30,182 cycles = 99 channels
Memory • ‘C6211 Cache complicates estimates • Performance is 85-99% of optimal for typical applications • 30,182 cycles becomes 35,508 cycles/frame for 85% efficiency=> now support only 86 channels
Memory • Try to account for off-chip memory transfers • ~220,000 cycles for 150ns fetches for 80 channels=> support 75-80 channels • Unable to verify/simulate because of unexpected tool expiration
Memory • L2 usage • ~16kB Code size thanks to VLIW • 512 32-byte instruction clusters • More suited for ‘C6201 & larger processors • Remaining used by data for channels • 480 bytes each (8.5% of remaining memory) • L1 usage • L1P: Can’t tell because of cache • L1D: 2.2kB (~56%)
Tool comments • Powerful, easy to use IDE… • When it worked. • Licensing problems for eval version • Debugging support a bit odd • puts/printf
C6x Conclusions • Easily support 75-80 channels of coding • 26 dB fixed-point SNR, 16-bit types • VLIW = Large code size • Cache on a low-end DSP! • Good tools,but draconian copy protection