1 / 25

Design Space Exploration with SimpleScalar

Design Space Exploration with SimpleScalar. The SimpleScalar Toolset. The Simplescalar Toolset. Simluation Suite. SimpleScalar ISA. clean and simple instruction set architecture: MIPS/ DLX + more addressing modes - delay slots 64- bit inst encoding facilitates instruction set research

raiden
Download Presentation

Design Space Exploration with SimpleScalar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Space Explorationwith SimpleScalar

  2. The SimpleScalar Toolset

  3. The Simplescalar Toolset

  4. Simluation Suite

  5. SimpleScalar ISA • clean and simple instruction set architecture: • MIPS/ DLX + more addressing modes - delay slots • 64- bit inst encoding facilitates instruction set research • 16- bit space for hints, new insts, and annotations • four operand instruction format, up to 256 registers

  6. SimpleScalar Architected State

  7. Out of order simulator Configurable set of FUs

  8. Configurable Memory Hierarchy • All caches and TLB configurations specified with same format: < nsets>:< bsize>:< assoc>:< repl> • Block replacement policy l - for LRU f - for FIFO r - for RANDOM

  9. Configurable Memory Hierarchy

  10. Design Space Exploration • Metric definition • Energy*Delay • Area*Delay • Design space definition • L1 and L2 caches, n° ALUs ... • Embedded Application Definition • Metric minimization • Exhaustive search • Greedy search • Gradient search • Simulated Annealing and so on

  11. Design Space Exploration:A case study. • Metric Defined: Price over Performance= area*CPI • Design space: • Sets, block, associativity and replacement polocy for each cache; • number of integer ALUs; • number of integer multipliers; • number of floating-point ALUs; • number of floating-point multipliers. Design space exploration performed by F. Cassoli and A. Ferrante @ ALARI

  12. Design Space Definition • Ranges for each parameter • DL1:128:{32, 64}:4:L • IL1:{256, 512}:32:1:L • UL2:{1024, 2048}:{64, 128}:4:{L, F} • IALU:{2, 4} • IMULT:{1, 2, 4} • FPALU:{1, 4} • FPMULT:{1, 2} • 768 different cases

  13. Embedded Application • EPIC decoder (Efficient Pyramid Image deCoder) • Image data compression utility written in C. • Free Mediabench Source • Based on wavelet decomposition and a Huffman entropy (de)coder.

  14. Cost Function F(x)= A(x)*D(x) • Area of x (sum of equivalent gates of each module). Models found in the literature. • Delay of x (computed through simulation of EPIC on architecture x).

  15. Result of the exhaustive search

  16. Optimal Configuration • The lowest value of the PoP is 998’732.31, obtained with: DL1: 128:32:4:L IL1: 256:32:1:L UL2: 1024:64:4:F IALU: 4 IMULT: 2 FPALU: 4 FPMULT: 2

  17. Cost Function Properties • The difference between the PoPs for a DL1 cache of 32 and of 64 sets is very little. • The difference between the PoPs for a IL1 cache of 256 and of 512 sets is very little.

  18. Cost Function Properties • Increasing the sets of UL2 increases the PoP (in average). • Augmenting the dimension of the block of the UL2 cache always leads to an abrupt growth of the PoP. • The L2-cache dimension grows very much, so that the cache becomes significantly larger that the rest of the system.

  19. Cost Function Properties

  20. Cost Function Properties

  21. Cost Function Properties

  22. Area – CPI scatter plot

  23. Conclusions • Reduction of PoP when the number of integer ALUs is doubled. Great benefit with reduced area increase. • Optimal configuration has IMULT = 2, (not 1 or 4, because EPIC does not expose much parallelism). • However FPALU = 4 leads to better results than FPALU = 1. • L2 FIFO policy outperforms LRU. • Same benefits when adding an FPMULT.

  24. Conclusions • A greedy algorithm has also been applied to minimize the cost function. • Starting from different points • average number of simulations required= 49 • minimum number of simulations required= 11 • maximum number of simulations required=83 • Full search optimum always reached • Considering that an exhaustive search needs 768 simulations, we reduce time of about 93.6%.

More Related