170 likes | 314 Views
Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design. Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008. Abstract.
E N D
Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008
Abstract • Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate. • This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15 %, respectively.
What’s the Problem • Primary (L1) caches in embedded processors are direct-mapped for power efficiency • However, direct-mapped caches are predisposed to thrashing • Hence, require a cache design that will • Improve performance, power efficiency, and determinism • Minimize the area cost
Related Works Cache optimization techniques for embedded processors Reduce cache conflict and cache pollution Increase power efficiency Provide extended associativity Improve cache utilization Retain data evicted from $ in a small associative victim cache [2] Pseudo-associative caches: place blocks in a second associated line[5] Dual data cache scheme that can distinguish spatial, temporal, singe-use memory reference [3] Application-specific cache partitioning [4] Filter caches [6] Shut down cache ways that adapts to application [7] Expandable cache lookup only when necessary Dynamically detect the thrashing behavior and expand the select sets for data cache This Paper:
Motivative Example • Illustrate why we need to expand the select sets dynamically • Insufficiency of the victim cache • Example thrashing code • B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R Set-R Set-S Set-Q Successive cache thrashing
Motivative Example- Cont. • Cache trace of the example thrashing code • B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R Main cache B[i] B[i] E[i] Set-S Set-Q C[i] C[i] F[i] Set-R A[i] A[i] D[i] Set-* … 2 entry victim cache B[i] A[i] F[i] C[i] E[i] D[i] Uncorrelated evicted data polluting the victim cache
The Dynamically Expandable L1 Cache Architecture 2nd. Expandable cache lookup 1st. Circular recently-evicted-set list
(1) Circular Recently-Evicted-Set List • A small circular list • Keep track of the index of the most recently evicted sets • Goal: detect a probable thrashing set • Operation • Look up the circular list only when a cache miss • If the missed set is present in the list • Enable the expand bit for that set • The access and update of the circular list • Only occur during a cache miss • Timing is not affected Conclude the current set is in a thrashing state and should dynamically be expanded
(2) Expandable Cache Lookup • Goal: allow a set to re-lookup into a predefined secondary set (virtually double associativity of a given set) • Operation • The secondary set is determined by a fixed mapping function • Flip the most significant bit of the set index • Besides expand bit, toggle bit for each cache set • Lookup initially on primary set or secondary set • Enable: when a cache hit occurs on the secondary set • Disable: when a cache hit occurs on the primary set Probable thrashing set is detected by first mechanism 1st lookup: cache miss and expand bit= 1 2nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss “00” “01” “10” “11” Index
A Demonstrative Example • Cache trace of the proposed cache architecture Main cache Expand 1 Set-S B[i] E[i] Set-Q C[i] F[i] 1 Set-R A[i] D[i] 1 … … Set-S’ B[i] Set-Q’ C[i] A[i] Set-R’ Circular list Update Set-S Update Set-Q Update Set-R
Experimental Setup • Use SimpleScalar toolset [8] for performance evaluation • Two baseline configuration • 256-set, direct-mapped L1 data cache with a 32-byte line size • 256-set, 4-way set-associative L1 data cache with a 32-byte line size • Use CACTI[10] to evaluate the power efficiency • Assume L1/L2 power ratio of 20 • Cost 20 times of power to access data in L2 than it does in L1 • Benchmarks • 7 representative programs of the SPEC CPU2000 suite [9]
Performance Improvement- Direct-Mapped Cache Improvement over baseline • Criteria: miss rate reduction • The miss rate improvement of the proposed implementation • The arithmetic mean is 30.75% 5-entry recently-evicted-set list 8-entry victim $
Performance Improvement- 4-Way Set-Associative Cache • Criteria: miss rate reduction • The miss rate improvement of the proposed implementation • The arithmetic mean is 26.74% Improvement over baseline 8-entry recently-evicted-set list 64-entry victim $ Significant miss rate reduction for both direct-mapped and set-associative caches
Power Improvement- Direct-Mapped Cache • The power reduction of the proposed implementation • The average is 15.73% Consistently provide power reduction
Power Improvement- 4-Way Set-Associative Cache • However, the power reduction across the benchmarks • The average was still an improvement of 4.19% Exception: Higher power costs
Conclusions • This paper proposed a dynamically expandable data cache architecture • Compose of two main mechanisms • Circular recently-evicted-set list • Detect a probable thrashing set • Expandable cache lookup • Virtually increase the associativity of a given set • Experimental results show that the proposed technique • Significant reduction in cache misses and power consumption • For both direct-mapped and set-associative caches
Comment for This Paper • The related works are not strongly connected • The results for power usage improvement are too coarse • Don’t show the extra power consumption of the support circuit • The results for different length of the circular list are not shown