200 likes | 340 Views
Performance/Area Efficiency in Embedded Chip Multiprocessors with Micro-caches. Michela Becchi Mark Franklin Patrick Crowley. ACM International Conference on Computing Frontiers 2007. Context. Throughput-oriented parallel, embedded applications, such as networking System performance goals
E N D
Performance/Area Efficiency in Embedded Chip Multiprocessors with Micro-caches Michela Becchi Mark Franklin Patrick Crowley ACM International Conference on Computing Frontiers 2007
Context • Throughput-oriented parallel, embedded applications, such as networking • System performance goals • Throughput • Area efficiency • Power efficiency • Area-efficient Chip Multiprocessors (CMP) • Intel’s 16-core IXP 2800 • Cisco’s Silicon Packet Processor
Motivation and Contribution • Observation: • Many throughput oriented applications (networking, communication) have small working set • Idea: • Performance/area efficiency can be increased by • replacing traditional i-caches with very small instruction caches (64-256 bytes), called μ-caches • Increased efficiency can be exploited in two ways • to add computational power, i.e., processor cores • to reduce overall area • Contribution: • Experimental study with the Tensilica Xtensa platform
System Model Data Memory Data Memory D$1 D$n D$1 D$n … … P1 Pn P1 Pn … … μ$1 μ$n … … I$1 I$n I$ Chip boundary Chip boundary Instruction Memory Instruction Memory Proposed Design Traditional Design
Xtensa Tensilica Environment • Cycle accurate system simulator • Built-in components • Configurable and extensible processor cores • L1 I- and D- cache (from 1K to 32K) • No built-in cache hierarchy allowed • No cache sharing possible • Local and system memory • On chip device-to-device connectors • Hardware supported lock objects • Interface for definition and integration of custom components (external devices) • Several interconnection opportunities: • Generic processor interface ports (PIF) • Processor local ports
Our Setup • Processor: Xtensa LX Microprocessor • 32-bit RISC, 5-stage scalar processor • Arranged in clusters that share a memory hierarchy • Single-threaded • Organized to resemble Cisco’s Silicon Packet processor • Custom Xstensa components: cache hierarchy • μ-caches • conveniently sizeable • connected through local ports • PIF cannot sustain required fetch request rates • cache misses modeled through processors’ stalls • Shared I-cache • Single-ported • Support hit-under-miss through configurable number of miss status holding registers (MSHRs)
Design Space (*) sized to ensure 99% hit rate on benchmarks of interest (**) clock rate of 300 MHz – Tensilica Xtensa’s estimate - assumed
Benchmarks T. Wolf and M. Franklin, “Commbench – A telecommunication benchmark for network processors”, Proc. ISPASS 2000 G. Memik et al., “Netbench: A benchmarking suite for network processors”, Proc. ICCAD 2001 Implementation of Viterbi algorithm: http://viterbi.wustl.edu
A-B C-D A-B C-D A A A B A-B C-D A-B C-D A D A C Case A Case B Case C Application Scenarios • Case A • Motivating case – code sharing on small working set • Exploitation of packet level parallelism • Case B • No code sharing in I-cache • Case C • Code sharing • Increased code size, working set
P I$ Base Case: 1-core Cluster P • Most configurations w/in 20% penalty • Potential for improvement with bigger clusters • No substantial difference when varying the μ-cache size μ$ vs. I$
Base Case: 1-core Cluster (cont’d) • Effect of μ-cache associativity • Direct mapped μ-caches generally better • Instructions fitting small μ-caches tend to belongs to simple loops residing in adjacent memory words • Set associative caches occupy more area than direct mapped ones • Effect of μ-cache line size • Performance decreases when going from 16B to 8B cache lines
Effect of Cluster Size • High μ-cache hit rate: speed-up proportional to cluster size • Lower μ-cache hit rate: Shared cache contentions affect speed up. • This effect gets worse when μ-cache size decreases
Non-blocking Shared Caches gzip 64B μ-caches cast Contention to shared cache hidden through miss-status holding registers (MSHRs) No more MSHRs than processors per cluster required
Performance/area Analysis • The Tensilica design environment estimates the area of processors and their components • 4KB, direct-mapped I-cache accounts for 26% of area in their 130nm process • We can use these area estimate, along with our performance results, to study performance/area efficiency
Performance/area Analysis 16 core: avg 22%
Accounting for Area Estimate Uncertainty Value previously used: 26% Speed-up increases exponentially w/ relative I-cache area Conservative 15 % ratio between μ-cache and I-cache (valid up to 55%)
Case B: Different Task on Each Core Dual core configurations w/ Commbench programs • Cast not well supported • Speed up unless cast is used in the deployment
Case C: Each Core Executes All Tasks Different combinations of Commbench programs • Despite greater code base, use of μ-caches helps • Each task can be considered as being a program phase • Mandatory misses at phase change correspond to a small fraction of the overall memory accesses 16 core: avg 25%
Summary • Observations: • Traditional I-caches are over-provisioned for networking applications, which often have small working sets • Optimal use of chip area is central issue in CMP design • Idea: • Trading cache area for processing power by taking advantage of 64B-256B μ-caches • Results (on Xtensa Tensilica platform): • Programs run in isolation: 16-core cluster w/ 256B μ-caches has on average 22% greater performance/area efficiency than traditional cluster w/ 4KB I-caches • Aggregate application consisting of a sequence of tasks: performance improvement is about 25% • Uncertainty on area estimates: μ-cache design beneficial for μ-caches occupying up to 55% the area of a I-cache.
Thanks • SBS group (Mark Franklin, Roger Chamberlain, Jim Buckley, Jeremy Buhler, Eric Tyson, Justin Brown, Saurabh Gayen, Patrick Crowley) • National Science Foundation (Grants CCF-0430012 and CCF-0427794)