260 likes | 455 Views
Design Space Exploration with SimpleScalar. The SimpleScalar Toolset. The Simplescalar Toolset. Simluation Suite. SimpleScalar ISA. clean and simple instruction set architecture: MIPS/ DLX + more addressing modes - delay slots 64- bit inst encoding facilitates instruction set research
E N D
SimpleScalar ISA • clean and simple instruction set architecture: • MIPS/ DLX + more addressing modes - delay slots • 64- bit inst encoding facilitates instruction set research • 16- bit space for hints, new insts, and annotations • four operand instruction format, up to 256 registers
Out of order simulator Configurable set of FUs
Configurable Memory Hierarchy • All caches and TLB configurations specified with same format: < nsets>:< bsize>:< assoc>:< repl> • Block replacement policy l - for LRU f - for FIFO r - for RANDOM
Design Space Exploration • Metric definition • Energy*Delay • Area*Delay • Design space definition • L1 and L2 caches, n° ALUs ... • Embedded Application Definition • Metric minimization • Exhaustive search • Greedy search • Gradient search • Simulated Annealing and so on
Design Space Exploration:A case study. • Metric Defined: Price over Performance= area*CPI • Design space: • Sets, block, associativity and replacement polocy for each cache; • number of integer ALUs; • number of integer multipliers; • number of floating-point ALUs; • number of floating-point multipliers. Design space exploration performed by F. Cassoli and A. Ferrante @ ALARI
Design Space Definition • Ranges for each parameter • DL1:128:{32, 64}:4:L • IL1:{256, 512}:32:1:L • UL2:{1024, 2048}:{64, 128}:4:{L, F} • IALU:{2, 4} • IMULT:{1, 2, 4} • FPALU:{1, 4} • FPMULT:{1, 2} • 768 different cases
Embedded Application • EPIC decoder (Efficient Pyramid Image deCoder) • Image data compression utility written in C. • Free Mediabench Source • Based on wavelet decomposition and a Huffman entropy (de)coder.
Cost Function F(x)= A(x)*D(x) • Area of x (sum of equivalent gates of each module). Models found in the literature. • Delay of x (computed through simulation of EPIC on architecture x).
Optimal Configuration • The lowest value of the PoP is 998’732.31, obtained with: DL1: 128:32:4:L IL1: 256:32:1:L UL2: 1024:64:4:F IALU: 4 IMULT: 2 FPALU: 4 FPMULT: 2
Cost Function Properties • The difference between the PoPs for a DL1 cache of 32 and of 64 sets is very little. • The difference between the PoPs for a IL1 cache of 256 and of 512 sets is very little.
Cost Function Properties • Increasing the sets of UL2 increases the PoP (in average). • Augmenting the dimension of the block of the UL2 cache always leads to an abrupt growth of the PoP. • The L2-cache dimension grows very much, so that the cache becomes significantly larger that the rest of the system.
Conclusions • Reduction of PoP when the number of integer ALUs is doubled. Great benefit with reduced area increase. • Optimal configuration has IMULT = 2, (not 1 or 4, because EPIC does not expose much parallelism). • However FPALU = 4 leads to better results than FPALU = 1. • L2 FIFO policy outperforms LRU. • Same benefits when adding an FPMULT.
Conclusions • A greedy algorithm has also been applied to minimize the cost function. • Starting from different points • average number of simulations required= 49 • minimum number of simulations required= 11 • maximum number of simulations required=83 • Full search optimum always reached • Considering that an exhaustive search needs 768 simulations, we reduce time of about 93.6%.