170 likes | 300 Views
Power Savings in Embedded Processors through Decode Filter Cache. Weiyu Tang, Rajesh Gupta , Alex Nicolau. Overview. Introduction Related Work Decode Filter Cache Results and Conclusion. Introduction. Instruction delivery is a major power consumer in embedded systems Instruction fetch
E N D
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau
Overview • Introduction • Related Work • Decode Filter Cache • Results and Conclusion
Introduction • Instruction delivery is a major power consumer in embedded systems • Instruction fetch • 27% processor power in StrongARM • Instruction decode • 18% processor power in StrongARM • Goal • Reduce power in instruction delivery with minimal performance penalty
Related Work • Architectural approaches to reduce instruction fetch power • Store instructions in small and power efficient storages • Examples: • Line buffers • Loop cache • Filter cache
Related Work • Architectural approaches to reduce instruction decode power • Avoid unnecessary decoding by saving decoded instructions in a separate cache • Trace cache • Store decoded instructions in execution order • Fixed cache access order • Instruction cache is accessed on trace cache misses • Targeted for high-performance processors • Increase fetch bandwidth • Require sophisticated branch prediction mechanisms • Drawbacks • Not power efficient as the cache size is large
Related Work • Micro-op cache • Store decoded instructions in program order • Fixed cache access order • Instruction cache and micro-op cache are accessed in parallel to minimize micro-op cache miss penalty • Drawbacks • Need extra stage in the pipeline, which increases misprediction penalty • Require a branch predictor • Per access power is large • Micro-op cache size is large • Power consumption from both micro-op cache and instruction cache
Decode Filter Cache • Targeted processors • Single issue, In-order execution • Research goal • Use a small (and power efficient) cache to save decoded instructions • Reduce instruction fetch power and decode power simultaneously • Reduce power without sacrificing performance • Problems to deal with • What kind of cache organization to use • Where to fetch instructions as instructions can be provided from multiple sources • How to minimize decode filter cache miss latency
Decode filter cache fetch decode execute mem writeback predictor Line buffer I-cache 1 5 4 3 2 Decode Filter Cache fetch address Processor Pipeline
Decode Filter Cache • Decode filter cache organization • Problems with traditional cache organization • The decoded instruction width varies • Save all the decoded instructions will waste cache space • Our approach • Instruction classification • Classify instructions into cacheable and uncacheable depending on instruction width distribution • Use a “cacheable ratio” to balance the cache utilization vs. the number of instructions that can be cached • Sectored cache organization • Each instruction can be cached independently of neighboring lines • Neighboring lines share a tag to reduce cache tag store cost
Decode Filter Cache • Where to fetch instructions • Instructions can be provided from one of the following sources • Line buffer • Decode filter cache • Instruction cache • Predictive order for instruction fetch • For power efficiency, either the decode filter cache or the line buffer is accessed first when an instruction is likely to hit • To minimize decode filter cache miss penalty, the instruction cache is accessed directly when the decode filter cache is likely to miss
Decode Filter Cache • Prediction mechanism • When next fetch address and current address map to the same cache line • If current fetch source is line buffer, the next fetch source remain the same • If current fetch source is decode filter cache and the corresponding instruction is valid, the next fetch source remain the same • Otherwise, the next fetch source is instruction cache • When fetch address and current address map to different cache lines • Predict based on next fetch prediction table, which utilizes control flow predictability • If the tag of current fetch address and the tag of the predicted next fetch address are same, next fetch source is decode filter cache • Otherwise, next fetch source is instruction cache
Results • Simulation setup • Media Benchmark • Cache size • 512B decode filter cache, 16KB instruction cache, 8KB data cache. • Configurations investigated
Conclusion • There is a basic tradeoff between • no. of the instructions cached as in instruction caches, and • greater savings in power by reducing decoding, fetch work (as in decode caches). • We tip this balance in the favor of decode cache by a coordinated operation of • instruction classification/selective decoding (into smaller widths) • sectored caches built around this classification • The results show • Average 34% reduction in processor power • 50% more effective in power savings than an instruction filter cache • Less than 1% performance degradation due to effective prediction mechanism