Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid Frank Vahid, UC Riverside

Trend Towards Pre-Fabricated Platforms: ASSPs • ASSP: application specific standard product • Domain-specific pre-fabricated IC • e.g., digital camera IC • ASIC: application specific IC • ASSP revenue > ASIC • ASSP design starts > ASIC • Unique IC design • Ignores quantity of same IC • ASIC design starts decreasing • Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside

Becoming out of reach of mainstream designers Will High End ICs Still be Made? • YES • The point is that mainstream designers likely won’t be making them • Very high volume or very high cost products • Platforms are one such product – high volume • Need to be highly configurable to adapt to different applications and constraints Frank Vahid, UC Riverside

UCR Focus • Configurable Cache • Hardware/Software Partitioning Frank Vahid, UC Riverside

Configurable Cache: Why • ARM920T: Caches consume half of total power (Segars 01) • M*CORE: Unified cache consumes half of total power (Lee/Moyer/Arends 99) Periph- erals JPEG dcd L1 cache L1 cache uP DSP FPGA IC Pre-fabricated Platform (A pre-designed system-level architecture) Frank Vahid, UC Riverside

Best Cache for Embedded Systems? • Not clear • Huge variety among popular embedded processors • What’s the best… • Associativity, Line size, Total size? Frank Vahid, UC Riverside

Set associative cache • Multiple “ways” • Fewer index bits, more tag bits, simultaneous comparisons • More expensive, but better hit rate Tag Index 11 D 0000 Conflict 110 D 100 C 000 Direct mapped cache (1-way set associative) 2-way set associative cache Cache Associativity A 00 0 000 • Direct mapped cache • Certain bits “index” into cache • Remaining “tag” bits compared B 01 0 000 C 10 0 000 D 11 0 000 Frank Vahid, UC Riverside

Cache Associativity • Reduces miss rate – thus improving performance • Impact on power and energy? • (Energy = Power * Time) Frank Vahid, UC Riverside

Associativity is Costly • Associativity improves hit rate, but at the cost of more power per access • Are the power savings from reduced misses outweighed by the increased power per hit? Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Energy per access for 8 Kbyte cache Frank Vahid, UC Riverside

Significantly poorer energy Associativity and Energy • Best performing cache is not always lowest energy Frank Vahid, UC Riverside

Associativity Dilemma • Direct mapped cache • Good hit rate on most examples • Low power per access • But poor hit rate on some examples • High power due to many misses • Four-way set-associative cache • Good hit rate on nearly all examples • But high power per access • Overkill for most examples, thus wasting energy • Dilemma: Design for the average or worst case? Frank Vahid, UC Riverside

Associativity Dilemma • Obviously not a clear choice • Previous work • Albonesi – proposed configurable cache having way shutdown ability to save dynamic power • Motorola M*CORE also 110 D 0000 11 0 000 Frank Vahid, UC Riverside

Our Solution: Way Concatenatable Cache • Can be configured as 4, 2, or 1 way • Ways can be concatenated 11x D 10x C 0000 This bit selects the way 11 0 000 Frank Vahid, UC Riverside

6x64 c0 c1 c3 c2 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0 Configuration circuit a11 Small area and performance overhead reg0 a12 reg1 tag part c3 c1 c0 c2 bitline c1 c0 index 6x64 6x64 6x64 data array c2 c3 6x64 6x64 column mux sense amps tag address line offset mux driver data output critical path Frank Vahid, UC Riverside

Way Concatenate Experiments • Experiment • Motorola PowerStone benchmark g3fax • Considering dynamic power only • L1 access energy, CPU stall energy, memory access energy • Way concatenate outperforms 4 way and direct map. • Just as good as way shutdown Frank Vahid, UC Riverside

Way Concatenate Experiments 100% = 4-way conventional cache • Considered 23 programs (Powerstone, MediaBench, and Spec2000) • Dynamic power only (L1 access energy, CPU stall energy, memory access energy) • Way concatenate • Better than way shutdown (due to less performance penalty) • Saves over conventional 4-way • Also avoids big penalties of 1-way on some programs Frank Vahid, UC Riverside

Way Concatenate Experiments • Best configuration varies • Need to tune configuration to a given program Frank Vahid, UC Riverside

Normalized Execution Times • Way shutdown suffers performance penalty • As does direct mapped • Way concatenate has almost no performance penalty • Though 3% longer critical path than conventional 4-way Frank Vahid, UC Riverside

Vdd bitline bitline Gated-Vdd Control Gnd Way Shutdown for Static Power Savings • Albonesi and Motorola used logic to gate clock • Reduced dynamic power, but not static (leakage) • Way concatenate clearly superior for reducing dynamic pwr • Shutting down ways still useful to save static power • But we’ll use another method (Agarwal DRG-cache) SRAM cell Frank Vahid, UC Riverside

Way Concatenate Plus Way Shutdown • We set static power = 30% of dynamic power • Way shutdown now preferred in many examples • But way concatenate still very helpful Frank Vahid, UC Riverside

Configurable Line Size Too 100% = 4-way conventional cache csb: concatenate plus shutdown cache • Best line size also differs per example • Our cache can be configured for line of 16, 32 or 64 bytes • 64 is usually best; but 16 is much better in a couple cases Frank Vahid, UC Riverside

Configurable Cache • A configurable cache with way concatenation, way shutdown, and variable line size, can save a lot of energy • Well-suited for configurable devices like Triscend’s Frank Vahid, UC Riverside

UCR Focus • Configurable Cache • Hardware/Software Partitioning Frank Vahid, UC Riverside

Using On-Chip FPGA to Reduce Sw Energy • Hennessey/Patterson: • “The best way to save power is to have less hardware” (pg 392) • Actually, best way is to have less ACTIVE hw • Paradoxically, MORE hw can actually REDUCE power, as long as overall activity is reduced • How? Frank Vahid, UC Riverside

uP FPGA Using On-Chip FPGA to Reduce Sw Energy • Move critical sw loops to FPGA • Loop executes in 1/10th the time • Use this time to power down the system longer during task period • Alternatively, slow down the microprocessor using voltage scaling Periph- erals JPEG dcd L1 cache uP DSP FPGA IC uP active idle Pre-fabricated Platform idle uP FPGA Task period Frank Vahid, UC Riverside

The 90-10 rule (or 80-20 rule) • Most software time is spent in a few small loops • e.g., MediaBench and NetBench benchmarks • Known as the 90-10 rule • 10% of the code accounts for 90% of the execution time • Move those loops to FPGA Frank Vahid, UC Riverside

Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Hardware/Software Partitioning Results Simulation based Frank Vahid, UC Riverside

Analysis of Ideal Speedup • Each loop is 10x faster in hw (average based on observations) • Notice the leveling off after the first couple loops (due to 90-10 rule) • Thus, most speedup comes from the first few loops • Good for us -- Moderate amount of FPGA gives most of the speedup • How much FPGA? Frank Vahid, UC Riverside

Speedup Gained with Relatively Few Gates • Manually created several partitioned versions of each benchmarks • Most speedup gained with first 20,000 gates • Surprisingly few gates • Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 • Stitt and Vahid, IEEE Design and Test, Dec. 2002 • J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside

Impact of Microprocessor/FPGA Clock Ratio • Previous data assumed equal clock freq. • A faster microprocessor has significant impact • Analyzed 1:1, 2:1, 3:1, 4:1, 5:1 ratios • Planning additional such analyses • Memory bandwidth • Power ratios • More Frank Vahid, UC Riverside

Software Improvements using On-Chip Configurable Logic – Verified through Physical Measurement A7 IC • Performed physical measurements on Triscend A7 and E5 devices • Similar results (even a bit better) Triscend A7 development board Frank Vahid, UC Riverside

Other Research Directions: Tiny Caches • Impact of tiny caches on instruction fetch power • Filter caches, dynamic loop cache, preloaded loop cache • Gordon-Ross, Cotterell, Vahid, Comp. Arch. Letters 2002 • Gordon-Ross, Vahid, ICCD 2002. • Cotterell, Vahid, ISSS 2002 and ICCAD 2002 • Gordon-Ross, Cotterell, Vahid, IEEE TECS, 2002 L1 cache or I-mem Loop cache Mux Processor Frank Vahid, UC Riverside

Other Research Directions: Platform-Based CAD • Use physical platform to aid search of configuration space • Configure cache, hw/sw partition • Configure, execute, and measure • Goal: Define best cooperation between desktop CAD and platform • NSF grant 2002-2005 (with N. Dutt at UC Irvine) Frank Vahid, UC Riverside

Other Research Directions: Dynamic Hw/Sw Partitioning • My favorite  • Add component on-chip: • Detects most frequent sw loops • Decompiles a loop • Performs compiler optimizations • Synthesizes to a netlist • Places and routes the netlist onto FPGA • Updates sw to call FPGA • Self-improving IC • Can be invisible to designer • Appears as efficient processor • Can also dynamically tune the cache configuration Mem Processor D$ I$ Profiler Config. Logic Mem DMA Proc. Frank Vahid, UC Riverside

Current Researchers Working in Embedded Systems at UCR • Prof. Frank Vahid • 5 Ph.D. students, 2 M.S. • Prof. Walid Najjar • 3 Ph.D. students, 1 M.S., working on hw/sw partitioning, and on compiling C to FPGAs • Prof. Tom Payne • 1 Ph.D. student, working on compiling C to FPGAs • Prof. Jun Yang (new hire) • Working on low power architectures (frequent value detection) • Prof. Harry Hsieh • 2 Ph.D. students, working on formal verification of system models • Prof. Sheldon Tan (new hire) • 1 Ph.D, working on physical design, and analog synthesis Frank Vahid, UC Riverside

Conclusions • Highly configurable platforms have a bright future • Cost equations just don’t justify ASIC production as much as before • Triscend parts are well situated; close collaboration desired • Configurable cache improves memory energy • Tuning to a particular program is CRUCIAL to low energy • Way concatenation is effective at reducing dynamic power • Way shutdown saves static power • Variable line size reduces traffic • All must be tuned to a particular program • Configurable logic improves software energy • Without requiring excessive amounts of hardware • Many exciting avenues to investigate! Frank Vahid, UC Riverside

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning

Recent Results at UCR with Configurable Cache and Hw/Sw Partitioning

Presentation Transcript

Additional Cache Notes

Department of computer science and engineering Software TESTING METHODOLOGIES

Directory-Based Cache Coherence

Partitioning III

V3 Matrix algorithms and graph partitioning

2 . 7. Spatial Partitioning

Graph Partitioning

A Self-Tuning Cache architecture for Embedded Systems

Fast Configurable-Cache Tuning with a Unified Second-Level Cache

Cache

Recent CLEO results on hadron spectroscopy

Recent results from the K2K experiment

Configurable Cache Subsetting for Fast Cache Tuning

L30: Partitioning

Circuit Partitioning

CSE 242A Integrated Circuit Layout Automation

VERITAS Collaboration Recent Results:

CS 267 Applications of Parallel Computers Lecture 15: Graph Partitioning - II

A Partitioning Methodology for BDD-based Verification

Utility-Based Partitioning of Shared Caches

CS 267: Applications of Parallel Computers Graph Partitioning

Lecture 5 Cache Operation