1 / 20

Performance/Area Efficiency in Embedded Chip Multiprocessors with Micro-caches

Performance/Area Efficiency in Embedded Chip Multiprocessors with Micro-caches. Michela Becchi Mark Franklin Patrick Crowley. ACM International Conference on Computing Frontiers 2007. Context. Throughput-oriented parallel, embedded applications, such as networking System performance goals

anne-sharpe
Download Presentation

Performance/Area Efficiency in Embedded Chip Multiprocessors with Micro-caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance/Area Efficiency in Embedded Chip Multiprocessors with Micro-caches Michela Becchi Mark Franklin Patrick Crowley ACM International Conference on Computing Frontiers 2007

  2. Context • Throughput-oriented parallel, embedded applications, such as networking • System performance goals • Throughput • Area efficiency • Power efficiency • Area-efficient Chip Multiprocessors (CMP) • Intel’s 16-core IXP 2800 • Cisco’s Silicon Packet Processor

  3. Motivation and Contribution • Observation: • Many throughput oriented applications (networking, communication) have small working set • Idea: • Performance/area efficiency can be increased by • replacing traditional i-caches with very small instruction caches (64-256 bytes), called μ-caches • Increased efficiency can be exploited in two ways • to add computational power, i.e., processor cores • to reduce overall area • Contribution: • Experimental study with the Tensilica Xtensa platform

  4. System Model Data Memory Data Memory D$1 D$n D$1 D$n … … P1 Pn P1 Pn … … μ$1 μ$n … … I$1 I$n I$ Chip boundary Chip boundary Instruction Memory Instruction Memory Proposed Design Traditional Design

  5. Xtensa Tensilica Environment • Cycle accurate system simulator • Built-in components • Configurable and extensible processor cores • L1 I- and D- cache (from 1K to 32K) • No built-in cache hierarchy allowed • No cache sharing possible • Local and system memory • On chip device-to-device connectors • Hardware supported lock objects • Interface for definition and integration of custom components (external devices) • Several interconnection opportunities: • Generic processor interface ports (PIF) • Processor local ports

  6. Our Setup • Processor: Xtensa LX Microprocessor • 32-bit RISC, 5-stage scalar processor • Arranged in clusters that share a memory hierarchy • Single-threaded • Organized to resemble Cisco’s Silicon Packet processor • Custom Xstensa components: cache hierarchy • μ-caches • conveniently sizeable • connected through local ports • PIF cannot sustain required fetch request rates • cache misses modeled through processors’ stalls • Shared I-cache • Single-ported • Support hit-under-miss through configurable number of miss status holding registers (MSHRs)

  7. Design Space (*) sized to ensure 99% hit rate on benchmarks of interest (**) clock rate of 300 MHz – Tensilica Xtensa’s estimate - assumed

  8. Benchmarks T. Wolf and M. Franklin, “Commbench – A telecommunication benchmark for network processors”, Proc. ISPASS 2000 G. Memik et al., “Netbench: A benchmarking suite for network processors”, Proc. ICCAD 2001 Implementation of Viterbi algorithm: http://viterbi.wustl.edu

  9. A-B C-D A-B C-D A A A B A-B C-D A-B C-D A D A C Case A Case B Case C Application Scenarios • Case A • Motivating case – code sharing on small working set • Exploitation of packet level parallelism • Case B • No code sharing in I-cache • Case C • Code sharing • Increased code size, working set

  10. P I$ Base Case: 1-core Cluster P • Most configurations w/in 20% penalty • Potential for improvement with bigger clusters • No substantial difference when varying the μ-cache size μ$ vs. I$

  11. Base Case: 1-core Cluster (cont’d) • Effect of μ-cache associativity • Direct mapped μ-caches generally better • Instructions fitting small μ-caches tend to belongs to simple loops residing in adjacent memory words • Set associative caches occupy more area than direct mapped ones • Effect of μ-cache line size • Performance decreases when going from 16B to 8B cache lines

  12. Effect of Cluster Size • High μ-cache hit rate: speed-up proportional to cluster size • Lower μ-cache hit rate: Shared cache contentions affect speed up. • This effect gets worse when μ-cache size decreases

  13. Non-blocking Shared Caches gzip 64B μ-caches cast Contention to shared cache hidden through miss-status holding registers (MSHRs) No more MSHRs than processors per cluster required

  14. Performance/area Analysis • The Tensilica design environment estimates the area of processors and their components • 4KB, direct-mapped I-cache accounts for 26% of area in their 130nm process • We can use these area estimate, along with our performance results, to study performance/area efficiency

  15. Performance/area Analysis 16 core: avg 22%

  16. Accounting for Area Estimate Uncertainty Value previously used: 26% Speed-up increases exponentially w/ relative I-cache area Conservative 15 % ratio between μ-cache and I-cache (valid up to 55%)

  17. Case B: Different Task on Each Core Dual core configurations w/ Commbench programs • Cast not well supported • Speed up unless cast is used in the deployment

  18. Case C: Each Core Executes All Tasks Different combinations of Commbench programs • Despite greater code base, use of μ-caches helps • Each task can be considered as being a program phase • Mandatory misses at phase change correspond to a small fraction of the overall memory accesses 16 core: avg 25%

  19. Summary • Observations: • Traditional I-caches are over-provisioned for networking applications, which often have small working sets • Optimal use of chip area is central issue in CMP design • Idea: • Trading cache area for processing power by taking advantage of 64B-256B μ-caches • Results (on Xtensa Tensilica platform): • Programs run in isolation: 16-core cluster w/ 256B μ-caches has on average 22% greater performance/area efficiency than traditional cluster w/ 4KB I-caches • Aggregate application consisting of a sequence of tasks: performance improvement is about 25% • Uncertainty on area estimates: μ-cache design beneficial for μ-caches occupying up to 55% the area of a I-cache.

  20. Thanks • SBS group (Mark Franklin, Roger Chamberlain, Jim Buckley, Jeremy Buhler, Eric Tyson, Justin Brown, Saurabh Gayen, Patrick Crowley) • National Science Foundation (Grants CCF-0430012 and CCF-0427794)

More Related