270 likes | 400 Views
Reconfigurable Caches and their Application to Media Processing. Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University Houston, Texas. Sarita Adve Dept. of Computer Science University of Illinois at Urbana Champaign Urbana, Illinois.
E N D
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University Houston, Texas Sarita Adve Dept. of Computer Science University of Illinois at Urbana Champaign Urbana, Illinois Norman P. Jouppi Western Research Laboratory Compaq Computer Corporation Palo Alto, California
Motivation (1 of 2) • Different workloads on general-purpose processors • Scientific/engineering, databases, media processing, … • Widely different characteristics • Challenge for future general-purpose systems • Use most transistors effectively for all workloads
Motivation (2 of 2) • Challenge for future general-purpose systems • Use most transistors effectively for allworkloads • 50% to 80% of processor transistors devoted to cache • Very effective for engineering and database workloads • BUT large caches often ineffective for media workloads • Streaming data and large working sets [ISCA 1999] • Can we reuse cache transistors for other useful work?
Contributions • Reconfigurable Caches • Flexibility to reuse cache SRAM for other activities • Several applications possible • Simple organization and design changes • Small impact on cache access time
Contributions • Reconfigurable Caches • Flexibility to reuse cache SRAM for other activities • Several applications possible • Simple organization and design changes • Small impact on cache access time • Application for media processing • e.g., instruction reuse – reuse memory for computation • 1.04X to 1.20X performance improvement
Outline for Talk • Motivation • Reconfigurable caches • Key idea • Organization • Implementation and timing analysis • Application for media processing • Summary and future work
On-chip SRAM Cache Partition A - cache Partition B - lookup Current use of on-chip SRAM Proposed use of on-chip SRAM Reconfigurable Caches: Key Idea Key idea: reuse cache transistors! • Dynamically divide SRAM into multiple partitions • Use partitions for other useful activities Cache SRAM useful for both conventional and media workloads
Reconfigurable Cache Uses • Number of different uses for reconfigurable caches • Optimizations using lookup tables to store patterns • Instruction reuse, value prediction, address prediction, … • Hardware and software prefetching • Caching of prefetched lines • Software-controlled memory • QoS guarantees, scratch memory area Cache SRAM useful for both conventional and media workloads
On-chip SRAM Cache Partition A - cache Partition B - lookup Current use of on-chip SRAM Proposed use of on-chip SRAM Key Challenges • How to partition SRAM? • How to address the different partitions as they change? • Minimize impact on cache access (clock cycle) time • Associativity-based partitioning
Address State Tag Data Tag Index Block Way 1 Way 2 Data out Compare Select Hit/miss Conventional Cache Organization
Address State Tag Data Tag Index Block Way 1 Partition 1 Choose Way 2 Tag Index Block Partition 2 Data out Compare Select Hit/miss Associativity-Based Partitioning Partition at granularity of “ways” Multiple data paths and additional state/logic
Reconfigurable Cache Organization • Associativity-based partitioning • Simple - small changes to conventional caches • But # and granularity of partitions depends on associativity • Alternate approach: Overlapped-wide-tag partitioning • More general, but slightly more complex • Details in paper
On-chip SRAM Cache Partition A Partition B Current use of on-chip SRAM Proposed use of on-chip SRAM Other Organizational Choices (1 of 2) • Ensuring consistency of data at repartitioning • Cache scrubbing: flush data at repartitioning intervals • Lazy transitioning: Augment state with partition information • Addressing of partitions - software (ISA) vs. hardware
On-chip SRAM Cache Partition A Partition B Current use of on-chip SRAM Proposed use of on-chip SRAM Other Organizational Choices (2 of 2) • Method of partitioning - hardware vs. software control • Frequency of partitioning - frequent vs. infrequent • Level of partitioning - L1, L2, or lower levels • Tradeoffs based on application requirements
Outline for Talk • Motivation • Reconfigurable caches • Key idea • Organization • Implementation and timing analysis • Application for media processing • Summary and future work
Conventional Cache Implementation • Tag and data arrays split into multiple sub-arrays • to reduce/balance length of word lines and bit lines ADDRESS DATA ARRAY TAG ARRAY BIT LINES WORD LINES DECODERS COLUMN MUXES SENSE AMPS COMPARATORS MUX DRIVERS DATA OUTPUT DRIVER OUTPUT DRIVERS VALID OUTPUT
[1:NP] [1:NP] Changes for Reconfigurable Cache ADDRESS [1:NP] • Associate sub-arrays with partitions • Constraint on minimum number of sub-arrays • Additional multiplexors, drivers, and wiring DATA ARRAY BIT LINES TAG ARRAY WORD LINES DECODERS COLUMN MUXES SENSE AMPS COMPARATORS MUX DRIVERS [1:NP] DATA OUTPUT DRIVER OUTPUT DRIVERS VALID OUTPUT [1:NP]
Impact on Cache Access Time • Sub-array-based partitioning • Multiple simultaneous accesses to SRAM array • No additional data ports • Timing analysis methodology • CACTI analytical timing model for cache time (Compaq WRL) • Extended to model reconfigurable caches • Experiments varying cache sizes, partitions, technology, …
Impact on Cache Access Time • Cache access time • Comparable to base (within 1-4%) for few partitions (2) • Higher for more partitions, especially with small caches • But still within 6% for large caches • Impact on clock frequency likely to be even lower
Outline for Talk • Motivation • Reconfigurable caches • Application for media processing • Instruction reuse with media processing • Simulation results • Summary and future work
Application for Media Processing • Instruction reuse/memoization[Sodani and Sohi, ISCA 1997] • Exploits value redundancy in programs • Store instruction operands and result in reuse buffer • If later instruction and operands match in reuse buffer, • skip execution; • read answer from reuse buffer cache partition cache partition cache partition Few changes for implementation with reconfigurable caches
Simulation Methodology • Detailed simulation using RSIM (Rice) • User-level execution-driven simulator • Media processing benchmarks • JPEG image encoding/decoding • MPEG video encoding/decoding • GSM speech decoding and MPEG audio decoding • Speech recognition and synthesis
System Parameters • Modern general-purpose processor with ILP+media extensions • 1 GHz, 8-way issue, OOO, VIS, prefetching • Multi-level memory hierarchy • 128KB 4-way associative 2-cycle L1 data cache • 1M 4-way associative 20-cycle L2 cache • Simple reconfigurable cache organization • 2 partitions at L1 data cache • 64 KB data cache, 64KB instruction reuse buffer • Partitioning at start of application in software
Impact of Instruction Reuse • Performance improvementsfor all applications(1.04X to 1.20X) • Use memory to reduce compute bottleneck • Greater potential with aggressive design [details in paper] 100 100 100 92 89 84 JPEG decode MPEG decode Speech synthesis
Summary • Goal: Use cache transistors effectively for all workloads • Reconfigurable Caches:Flexibility to reuse cache SRAM • Simple organization and design changes • Small impact on cache access time • Several applications possible • Instruction reuse -reuse memory for computation • 1.04X to 1.20X performance improvement • More aggressive reconfiguration currently under investigation
More information available at • http://www.ece.rice.edu/~parthas • parthas@rice.edu