1 / 26

LPC Speech Coder on the TI C6x DSP

LPC Speech Coder on the TI C6x DSP. Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede. Summary. Implementation platform Texas Instruments TMS320C6000 Low-quantity cost US $35 (‘C6211) Architecture clock frequency 150 MHz (‘C6211) Throughput

kalin
Download Presentation

LPC Speech Coder on the TI C6x DSP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede

  2. Summary • Implementation platform • Texas Instruments TMS320C6000 • Low-quantity cost US $35 (‘C6211) • Architecture clock frequency • 150 MHz (‘C6211) • Throughput • 75-80 channels @ 8000 samples/sec

  3. Summary • Total energy per sample • 1.8 uJ/sample • ‘Area’ • 1.2% of cycle budget per chan. per frame • 8.5% of unified memory per channel • 25% of unified memory for algorithm

  4. Summary • Flexibility of implementation • High; programmable processor with C compiler, GUI debugger & simulator • SegSNR_A: • ? • SegSNR_Q: • 26 dB (voiced segments)

  5. Architecture overview • 256-bit VLIW • Two “clustered” data paths • Four functional units in each data path • 16x16 multiply • Two ALUs • Data addressing unit • 32-bit instruction for each functional unit • (256 bit “instruction” for 8 func. Units)

  6. Data path diagram

  7. Architecture overview • Split register file • Only two cross-paths exists • Cluster is limited to one source read from opposite register file per cycle. • Data types • 8, 16, 32-bit with 40-bit accumulate • 40-bit = register pair

  8. Memory architecture • ‘C6211 (US$35) has a cache! • 4kB L1 Instruction cache (L1P) • 4kB L1 Data cache (L1D) • 64kB L2 Unified memory and/or cache • Extra DMA channels

  9. Memory architecture

  10. Design Tools • Command-line • Compiler, debugger, simulator • Code Composer Studio • Same tools • Windows NT GUI • 30-day “evaluation” license • Draconian copy protection, pulls out the rug from under you

  11. Design Flow • Consolidate Matlab reference into a single function • Matlab rewritten C-style • Verified C-style Matlab • C prototype created • Imported into Code Composer, optimized & simulated

  12. Fixed-point quantization • Input samples • 16-bit, normalized to [-1,1) • <1.15> format used • Coefficient quantization • Hamming window, pre-emphasis, FIR • <1.15> format used • No noticeable change in characteristics

  13. Fixed-point quantization • Most values 16 bit • Take advantage of 16x16 fast multipliers • Remain close to other class implementations • Add metric for overpowered LPC engine • Use # of channels as performance metric

  14. Fixed-point quantization • Energy stored in <5.27> • Prevent overflow, provide precision for low energy segments • Temporary values stored in <10.30> • Take advantage of extended precision • Modified autocorrelation used <16.0> • All whole numbers

  15. Fixed-Point SNR • Matlab simulation of magnitude truncation • Tools again. • SegSNR_A = ? • SegSNR_Q = 26 dB • Voiced segments only • Sent_female test data

  16. Performance results • Initial version: 80,000 CPU cycles/frame • Optimization • Take advantage of VLIW, pipelining • observe assembly, modify C loops • Use TI’s DSP Library • Assembly advantage without assembly • Optimized version: 30,182 cycles/frame • Had to stop early, still at least 5K cycles wasted

  17. Performance • Then, the tool license expired. • The tool would not install on other machines. • TI responded, but wasn’t too helpful. • Moral #1: Avoid the evaluation version. • Moral #2: Give tools away to sell hardware

  18. Cycle count details

  19. Additional optimizations • Use more DSPLIB routines • Autocorrelation • Assembly-level optimization • Code size reduction? • Reduce number of buffers to reduce L1D usage per frame

  20. Energy per sample • ‘C6211 consumes 1.24W • 75% high activity / 25% low activity • 1.24W / 80 channels = 15.5mW/channel • 15.5 mJ/sec/channel * 1/8000 = 1.8 uJ / sample

  21. Number of channels 150 x 106 cycles/sec x 0.02 sec/frame = 3.0 x 106 cycles/frame 3.0 x 106 cycles/frame / 30,182 cycles = 99 channels

  22. Memory • ‘C6211 Cache complicates estimates • Performance is 85-99% of optimal for typical applications • 30,182 cycles becomes 35,508 cycles/frame for 85% efficiency=> now support only 86 channels

  23. Memory • Try to account for off-chip memory transfers • ~220,000 cycles for 150ns fetches for 80 channels=> support 75-80 channels • Unable to verify/simulate because of unexpected tool expiration

  24. Memory • L2 usage • ~16kB Code size thanks to VLIW • 512 32-byte instruction clusters • More suited for ‘C6201 & larger processors • Remaining used by data for channels • 480 bytes each (8.5% of remaining memory) • L1 usage • L1P: Can’t tell because of cache • L1D: 2.2kB (~56%)

  25. Tool comments • Powerful, easy to use IDE… • When it worked. • Licensing problems for eval version • Debugging support a bit odd • puts/printf

  26. C6x Conclusions • Easily support 75-80 channels of coding • 26 dB fixed-point SNR, 16-bit types • VLIW = Large code size • Cache on a low-end DSP! • Good tools,but draconian copy protection

More Related