260 likes | 406 Views
Configurable Cache Subsetting for Fast Cache Tuning. Pablo Viana 1 , Ann Gordon-Ross 2 , Eamonn Keogh 2 , Edna Barros 1 , Frank Vahid 2. 1 Computer Science Institute – CIN Federal University of Pernambuco - UFPE, Brazil 2 Department of Computer Science and Engineering
E N D
Configurable Cache Subsetting for Fast Cache Tuning Pablo Viana1, Ann Gordon-Ross2, Eamonn Keogh2, Edna Barros1, Frank Vahid2 1Computer Science Institute – CINFederal University of Pernambuco - UFPE, Brazil 2Department of Computer Science and Engineering University of California, Riverside This work was supported in part by the Capes Foundation BEX 1366/04-1
Outline • Introduction • Configurable cache tuning • Cache configuration space • Configuration space subsetting problem • An exhaustive analysis • A heuristic to subset the configuration space • Applying the subsetting method to a 2-level cache • Future work on other configurable platforms
44% Introduction • Caches can improve system performance. However, caches are power hungry. • Thus, the cache is a good candidate for optimization Power analysis of ARM970TS. Segars (ISSCC, 2001)
Choose lowest energy configuration Tuning Motivation • Tuning cache parameters (Total size, Line size, Associativity) to an application can reduce energy by 60% on average. Microprocessor Cache Energy Main Memory Possible Cache Configurations
Each application has different cache requirements 4kB, 2-way 16B/line 2kB, 1-way 16B/line 4kB, 4-way 32B/line 2kB, 1-way 64B/line 8kB, 2-way 16B/line 8kB, 4-way 32B/line Configurable cache optimization Cache Configuration Tuning Application 1 Application 2 Application 3 Application 4 Application 5 ... Application N Cache Configuration Space
Related Work • Configurable caches • Soft-core processors • ARM; • MIPS; • Tensillica • etc. • Hard-core processors • Motorola M*Core (Malik, ISLPED’00); • Albonesi (MICRO’00); • Zhang (ISCA’03) • Configurable cache tuning • Mostly done manually in practice • Sub-optimal • Time-consuming
Related Work • Automated Methods for Cache Tuning • Methods and heuristics for Design Space Exploration • Single-level caches (Tens of configurations) • Platune (Givargis TCAD’02, Palesi CODES’02) • Zhang (RSP’03) • Two-level caches (Hundreds to thousands of configurations) • Tcat (Gordon-Ross, DATE’04) Level 2 Level 1 * Total size Line size Associativity = 2500 configs Total size Line size Associativity Say 50 configs. Say 50 configs.
Problem Context • Do we really need such a large number of cache configurations? • Could just few carefully-chosen configurations adequately cover the large configuration space? • A smaller configuration space would be easier to explore (via simulation-based or dynamic tuning)
Automotivecontrol Networkprotocol ImageFiltering Configuration subset Configuration space Application domains Problem Context Potential scenario: • A configurable microprocessor vendor pre-selects a subset of configurations for each particular application domain. • The user selects the most appropriate domain for a target application and examine only the pre-selected subset.
Not that bad. Energy consumption Energy increase Lower energy Possible cache configurations Cache Configuration Tuning Energy to run a given application on different configurations
subset Cache Configuration Subsetting Application 1 Application 2 Application 3 Application 4 Application 5 ... Application N Cache Configuration Space Near optimal cache tuning: Average energy increases
Cache Configuration Subsetting Problem Definition: • Identify the subset of configurations that adequately covers the configuration space for the application domain. • We state that p configurations are necessary to cover a space of m points. Exhaustive Approach: • Select the subset of p configurations from the space m; Criterion for selection: Choosing the subset which keep the average energy of the tuned cache nearest to the optimal.
Memory Instruction cache Processor Data cache Experimental Setup • Configurable cache architecture for initial experiments: • Our target configurable cache architecture is based on Zhang/Vahid/Najjar’s “Highly-Configurable Cache Architecturefor Embedded Systems,” ISCA 2003 • We set the base cache to suport the following 18 configurations: • Energy model for estimation:
Energy increase is still under 5%. Exhaustive Approach • Subsets of size p=18 down to 1 were chosen through the exhaustive evaluation of the average energy increase.
Exhaustive Approach • Good results, but exhaustively determining the subsets requires too much computation: • For m=18 and 1 < p < 18 it gives us 262,143 possible combinations. That’s too expensive! • We need a more efficient way to find the subset of configurations...
Looking for a Heuristic • First attempt: • Choose the p configurations that offered optimal energy for the largest number of applications. Energy increase
Looking for a Heuristic • Second attempt: • Hierarchical clustering according to the similarity between configurations on their average energy savings. Energy increase
We found a problem similar to our “subsetting puzzle”. The problem of segmenting time series. Time series data set Data set representation Error (%) depends on average accuracy of the data set representation Similar Problem
The method proposed by Keogh’s can be applied, to reduce the number of primary color for displaying an image (IEEE Int. Conf. On Data Mining, 2001) Color/nuances of the image are merged according to the associated error/difference in the final image. by or Ex.: Lower average error? by 8 colors Merging and measuring error 36 colors Error of replacing the neighboring colors by merging them Similar Problem
Similarly, we may iteratively discard configurations by merging them. By merging two configs cj and ck into ck means that all applications which were tuned by cj, now use ck. Applicationai cj ck merging e(ck,ai) energy increase Adapting to the Subsetting Problem e(cj,ai)
All the possible merges of two adjacent (neighboring) configurations in the space are evaluated to find the best merging. Keogh’s Heuristic c7 c1 c2 c7 c1 c13
Energy increase Keogh’s Heuristic • Accuracy of the results
Comparing the Results • We went on merging the configuration space while the energy increase remains under 5% (4 configs).
Optimal cache configuration 1.00 0.90 Tuning with 4 configs 0.80 0.70 Memory 0.60 2nd Levelcache Instruction cache 0.50 Processor 0.40 0.30 Data cache 0.20 0.10 0.00 blit bilv jpeg brev bcnt g3fax iirflt01 matmul binary aifftr01 tblook01 matrix01 a2time01 rspeed01 canrdr01 puwmod01 AVERAGE cacheb01 Two-level Configurable Cache Average energy increase: 3.36%
Where Else this Method Could be Applied? • Other parameterized IP cores • Buses, I/O devices (Word size, bandwidth, etc). • Parameterized Platforms, • Processor’s functional units, cache, bus, multiplier, IPs. • Optimization for low energy, area, performance and others.
Conclusions • The configurable cache subsetting problem was presented; • A Data Mining algorithm for segmenting time series was adapted to tackle the cache subsetting problem; • Subsetting the Two-level cache configuration space, keeping the average energy increase under 5% takes: • Keogh’s heuristic: less than 1 minute • Exhaustively: around 53 hours. • Thanks to the versatility of the proposed method, our next step is to apply it for Platform customization. Thank you !