1 / 19

A DSP with Caches: A Study of the GSM-EFR Codec on the TI C6211

A DSP with Caches: A Study of the GSM-EFR Codec on the TI C6211. Tor Jeremiassen Bell Labs, Lucent Technologies tor@research.bell-labs.com. Outline. DSPs TI C6211 DSP GSM-EFR Speech Transcoder Methodology Results Conclusion. Digital Signal Processors. Low price

yates
Download Presentation

A DSP with Caches: A Study of the GSM-EFR Codec on the TI C6211

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A DSP with Caches:A Study of the GSM-EFR Codec on the TI C6211 Tor Jeremiassen Bell Labs, Lucent Technologies tor@research.bell-labs.com

  2. Outline • DSPs • TI C6211 DSP • GSM-EFR Speech Transcoder • Methodology • Results • Conclusion

  3. Digital Signal Processors • Low price • embedded in cheap devices, cell phones, disk drives, ABS brakes, modems, etc. • Low power • must be able to run off batteries • High (Enough) Performance • on the right applications • special hardware support to accelerate particular functions • saturating arithmetic, viterbi decode, manhatten distance • Deterministic running time, must satisfy hard real-time constraints • Strong bias against caches

  4. TI C62xx Series DSP Features • 8 wide VLIW-like architecture • two clusters • Statically scheduled • “dependence” bit in instruction word controls parallel issue • Fixed size 32 bit instruction set - small number of inst. formats • Predicated execution - 5 predicate registers • 5 cycle branch delay, 4 cycle load delay • Compiler is not just an afterthought • Aimed at communication infrastructure • High performance, higher power, higher price (relatively speaking)

  5. TI C6211 • Economy model of C62xx series • $25 vs. $80-$150 • Less memory on chip (cache vs. memory) 72 KB vs. 128 KB to 896 KB • Slower clock, 150 MHz vs. 200 MHz to 300 MHz • Cache organization: • I: 4 KB, 64 B, direct mapped • D: 4 KB, 32 B, 2-way, no write-allocate • 4 entry write-buffer • stall if full • empty before servicing read miss • L2: 0-64 KB, 128 B, 0-4 way Instruction Data 256 bits 128 bits Level 2 32 bits

  6. GSM-EFR Speech Transcoder • Global System for Mobile Communications - Enhanced Full Rate • European Telecommunications Standards Institute • Used in digital cellular telephony • Encodes 64Kb/s input speech and 12.2 Kb/s parameter stream • Size: encoder 13,000 lines of C, decoder 9,500 lines of C • Encoder roughly 10x complexity of decoder Coded speech (12.2Kb/s) Speech (64 Kb/s)

  7. Methodology • GSM-EFR Codec (encoder/decoder) • based on reference code supplied by ETSI • some loop optimizations applied by hand, simplified low level i/o • compiled using v2.10 of TI C6x compiler (-o3 + whole program opt.) • input: 1080 frames = 21.6 seconds of speech • Concatenation of GSM-EFR test vectors test10, test7 and test13. • Simulator • cycle accurate, instruction level simulator • cycles (excluding cache effects): encoder ~300M, decoder ~35M • Architecture • All second level memory used as cache except where noted • EMIF has priority over level one cache misses

  8. Overall Performance

  9. Overall Cycle Breakdown

  10. Memory Stall Cycles Breakdown

  11. Changing Level 2 Memory Organization • There is 64 KB of second level memory divided into 4, 16 KB blocks • 0 to 4 of the blocks can be configured as cache • set associativity is equal to the number of blocks configured as cache • cache block size is 128 B • Memory not configured as cache is local memory in a unique part of the address space • Experiments: • vary total second level memory available to application • vary the amount allocated as cache • use level 1 cache miss profiles to decide which blocks are put in local memory

  12. Changing L2 Organization - Encoder

  13. Changing L2 Organization - Decoder

  14. Performance Effect of Cache Pollution • The TI C6x series DSPs are aimed at multi-channel applications • e.g., cellular base stations • Switching between multiple applications pollutes the cache w.r.t. any single application • interrupt handlers also contribute • Need to understand the performance implications of periodic cache pollution on the performance of single applications • Pessimistic experiments: • Invalidate both level one caches at periodic intervals • Invalidate entire cache hierarchy at periodic intervals

  15. L1 invalidations - Encoder

  16. L1 Invalidations - Decoder

  17. L1 & L2 Invalidations - Encoder

  18. L1 & L2 Invalidations - Decoder

  19. Conclusions • Caches work for DSPs! (at least for the GSM-EFR) • miss rates are low, similar number of cycles lost due to nops on C6211 • Worst case assumption about cache miss impact on running time is unwarranted. • performance degrades gracefully with frequency of invalidations • But, don’t DSPs work on streaming data (locality breakdown)? • yes, but the ratio of computation to the bandwidth of the data stream is low in most applications, and the locality is captured well in small caches • Trend is towards bigger and more complex applications on DSPs

More Related