Hardware Design Space Exploration

1. Hardware Design Space Exploration+ Sudeep Pasricha Colorado State University CS/ECE 561 Fall 2011

2. Embedded Systems Design Flow

3. Design Space Exploration ADL Driven Processor Memory Exploration Communication Architecture Exploration

4. Traditional HW/SW Co-Design Flow --

5. ADL-Driven SOC Design Flow

6. ADL-driven Design Space Exploration

7. Exploration Methodology Four steps Architecture Specification Use of Architecture Description Language (ADL) Software Toolkit Generation Compiler, simulator, assembler, debugger Generation of Hardware Models (Prototypes) Design space exploration Find the best possible architecture for the given set of application programs under area, power, performance constraints

8. Architecture Description Languages Behavior-Centric ADLs ISPS, nML, ISDL, SCP/ValenC, ... primarily capture Instruction Set (IS) good for regular architectures, provides programmer�s view tedious for irregular architectures, hard to specify pipelining Structure-Centric ADLs MIMOLA, ... primarily capture architectural structure specify pipelining; drive code generation, arch. synthesis hard to extract instruction-set view Mixed-Level ADLs LISA, RADL, FLEXWARE, MDes, EXPRESSION, � combine benefits of both generate simulator and/or compiler

11. Example: Specification of the DLX Processor

12. Specification of the DLX Processor







19. V-SAT: Visual Specification and Analysis Tool

20. Illustrative example: TI C6x - I�ll use TI C6x architecture as an illustrative example to show how we capture memory subsystem in ADL - I will say few words about C6x architecture- I�ll use TI C6x architecture as an illustrative example to show how we capture memory subsystem in ADL - I will say few words about C6x architecture

21. Illustrative example: TI C6x - It has four fetch stages- It has four fetch stages

22. Illustrative example: C6x Architecture - Two decode stages - Two decode stages

23. Illustrative example: C6x Architecture - Two D unit stages. Other functional units are not shown.- Two D unit stages. Other functional units are not shown.

24. Illustrative example: C6x Architecture - Two memory controller stages- Two memory controller stages

25. Illustrative example: C6x Architecture - Finally novel memory subsystem consisting of * partitioned register file * Reconfigurable scratchpad SRAM. * Two level of cache hierarchy containing L1 and L2 caches * and off-chip DRAM with efficient access modes like normal read, page read, burst read etc. - So far I explained TI C6x processor and memory architecture, in the following slides I�ll show how we capture memory architecture and integrate it with processor description.- Finally novel memory subsystem consisting of * partitioned register file * Reconfigurable scratchpad SRAM. * Two level of cache hierarchy containing L1 and L2 caches * and off-chip DRAM with efficient access modes like normal read, page read, burst read etc. - So far I explained TI C6x processor and memory architecture, in the following slides I�ll show how we capture memory architecture and integrate it with processor description.

26. C6x Processor-Memory Pipeline Path - One of the important aspect of describing the memory architecture is integrating the memory pipeline with the processor pipeline. - Here I show an example, the pipeline path traversed by a particular load operation which is hit in L1 cache - So the pipeline consists of * Four fetch stages * Two decode stages * Two load/store stages * Memory controller stage 1 * L1 controller stage 1 determines its a hit * which takes it to L1_S2 stage and finally * to Memory controller stage 2 which writes to register file. - Even though this example pipeline path is flattened, we describe pipeline paths in hierarchical manner. - One of the important aspect of describing the memory architecture is integrating the memory pipeline with the processor pipeline. - Here I show an example, the pipeline path traversed by a particular load operation which is hit in L1 cache - So the pipeline consists of * Four fetch stages * Two decode stages * Two load/store stages * Memory controller stage 1 * L1 controller stage 1 determines its a hit * which takes it to L1_S2 stage and finally * to Memory controller stage 2 which writes to register file. - Even though this example pipeline path is flattened, we describe pipeline paths in hierarchical manner.

27. TI C6x L1 Cache Characteristics - We capture the characteristics of each unit of the memory architecture. For instance, for L1 cache we capture * type of cache * number of lines * linesize * wordsize * associativity * read/write latency * replacement_policy *write_policy etc. - We capture the characteristics of each unit of the memory architecture. For instance, for L1 cache we capture * type of cache * number of lines * linesize * wordsize * associativity * read/write latency * replacement_policy *write_policy etc.

28. DRAM �Normal Read� Pipeline - Similarly, we capture the characteristics of access modes. For instance, here we show what to capture for DRAM in �normal read� access mode. - For this mode, pipeline consists of three stages * row_dec, col_dec followed by precharge. * we also capture explicit timing of each pipeline stage * for instance, row_dec takes 6 cycles, column dec takes 1 cycle and precharge takes 6 cycles. - Once the memory subsystem is captured we are ready to generate memory-aware s/w toolkit to perform design space exploration and to give feedbacks to designers to show bottlenecks. - Similarly, we capture the characteristics of access modes. For instance, here we show what to capture for DRAM in �normal read� access mode. - For this mode, pipeline consists of three stages * row_dec, col_dec followed by precharge. * we also capture explicit timing of each pipeline stage * for instance, row_dec takes 6 cycles, column dec takes 1 cycle and precharge takes 6 cycles. - Once the memory subsystem is captured we are ready to generate memory-aware s/w toolkit to perform design space exploration and to give feedbacks to designers to show bottlenecks.

29. Experimental Setup - I present here a set of exploration experiments to demonstrate the usefulness of our approach. - Experimental setup is shown here. - We use TI C6x VLIW DSP - We considered memory configurations with varieties of modules. - We have chosen application programs from multimedia and dsp domains Performed design space exploration with the goal of studying trade-off between memory cost and processor performance. - I present here a set of exploration experiments to demonstrate the usefulness of our approach. - Experimental setup is shown here. - We use TI C6x VLIW DSP - We considered memory configurations with varieties of modules. - We have chosen application programs from multimedia and dsp domains Performed design space exploration with the goal of studying trade-off between memory cost and processor performance.

30. Memory Subsystem Configurations Table shows the memory subsystem configurations we have used for design space exploration. Each row corresponds to a memory configuration Each column corresponds to one memory module used in the configuration. For instance, Configuration 1 contains a small 128 words L1 cache and small 256 word stream buffer along with off chip DRAM. Config 2, instead of stream buffer, has large 2K on-chip SRAM. Third configuration replaces SRAM with L2 cache Fourth configuration uses same L1 cache with different replacement policy, LRU instead of FIFO Fifth configuration splits on-chip space between 1K L2 cache and 1K software controlled SRAM Last configuration contains very large SRAM. Configurations, 1 through 6, are shown in increasing order of cost in terms of memory area and the control logic. Configuration 1 is lowest cost whereas configuration 6 is highest cost configuration. Table shows the memory subsystem configurations we have used for design space exploration. Each row corresponds to a memory configuration Each column corresponds to one memory module used in the configuration. For instance, Configuration 1 contains a small 128 words L1 cache and small 256 word stream buffer along with off chip DRAM. Config 2, instead of stream buffer, has large 2K on-chip SRAM. Third configuration replaces SRAM with L2 cache Fourth configuration uses same L1 cache with different replacement policy, LRU instead of FIFO Fifth configuration splits on-chip space between 1K L2 cache and 1K software controlled SRAM Last configuration contains very large SRAM. Configurations, 1 through 6, are shown in increasing order of cost in terms of memory area and the control logic. Configuration 1 is lowest cost whereas configuration 6 is highest cost configuration.

31. Design Space Exploration Results For each benchmark we present dynamic cycle count for the six memory configurations presented earlier X-axis is benchmark we ran Y-axis is the cycle count For each benchmark the bar show the cycle count for the six configurations The configurations are shown in increasing order of cost. The blue one is config 1, lowest cost memory configuration The red one is config 6, highest cost configurations For each benchmark we present dynamic cycle count for the six memory configurations presented earlier X-axis is benchmark we ran Y-axis is the cycle count For each benchmark the bar show the cycle count for the six configurations The configurations are shown in increasing order of cost. The blue one is config 1, lowest cost memory configuration The red one is config 6, highest cost configurations

32. Design Space Exploration Results As expected, in some benchmarks higher memory cost generated better performance. For instance, ICCG, Integrate and Lowpass benchmarks As expected, in some benchmarks higher memory cost generated better performance. For instance, ICCG, Integrate and Lowpass benchmarks

33. Design Space Exploration Results However this is not always true. For other benchmarks un-intuitive reverse trend i.e., lower cost memory configuration generated better performance for Hydro, Tridiag and Stateeq benchmarks It is interesting to note that configuration 6 has worst performance for these three benchmarks. However this is not always true. For other benchmarks un-intuitive reverse trend i.e., lower cost memory configuration generated better performance for Hydro, Tridiag and Stateeq benchmarks It is interesting to note that configuration 6 has worst performance for these three benchmarks.

34. Cycle Accounting: Configuration 6 Cycle accounting for configuration 6 clearly explains the reason for the poor performance of configuration 6 for Hydro, tridiag and Stateeq. Not all the arrays fit in the SRAM and the lack of L1 cache to compensate the large latency of the DRAM creates its toll on performance These type of behavior is very hard to predict through analysis alone without simulation. Having such an explicit way to capture memory architecture and to drive DSE is crucial in order to give accurate feedback to designer or to evaluate processor-memory configurations.Cycle accounting for configuration 6 clearly explains the reason for the poor performance of configuration 6 for Hydro, tridiag and Stateeq. Not all the arrays fit in the SRAM and the lack of L1 cache to compensate the large latency of the DRAM creates its toll on performance These type of behavior is very hard to predict through analysis alone without simulation. Having such an explicit way to capture memory architecture and to drive DSE is crucial in order to give accurate feedback to designer or to evaluate processor-memory configurations.

35. Experimental Setup: DLX Case Study The Architecture capture the DLX architecture using EXPRESSION generate VHDL description synthesize using Synopsys Design Compiler The Application FFT benchmark. Explorations based on area, power and speed. Addition of Functional Units (Pipeline Paths) Addition of Pipeline Stages Addition of Operations

36. Addition of Functional Units Schedule length = number of application cycles (relative)Schedule length = number of application cycles (relative)

37. Addition of Pipeline Stages

38. Addition of Operations

39. EXPRESSION Documentation and EXPRESSION toolkit available for download at http://www.ics.uci.edu/~express/ Tested and runs fine with Visual C++ 6

40. Design Space Exploration ADL Driven Processor Memory Exploration Communication Architecture Exploration

41. Communication Architectures MPSoC communication architecture (CA) fabric has to cope with entire inter-component traffic considerable impact on performance leading cause of performance bottlenecks in MPSoC design consumes anywhere between 20 � 50% overall system power critical for battery-driven mobile applications has thermal implications, affecting packaging cost and device failure rate takes up significant chunk of design cycle increasing time to market Exploration of communication design space absolutely essential for embedded systems to meet performance requirements time-to-market deadlines

42. Exploration Challenge

43. 1. Component Mapping Exploration

44. Component Mapping Exploration




48. 2. Automated Synthesis of Hierarchical Shared Bus Architectures

49. Automated Synthesis of Hierarchical Shared Bus Architectures

50. Framework Overview

51. Graph Partitioning Approach

52. Experiments: Case Study 1











63. Comparison with Other Schemes

64. 3. Automated Bus Matrix Synthesis



67. Problem Formulation

68. Synthesis Framework Overview

69. Branch and Bound Clustering




73. Synthesis Framework Overview

74. Experiment Setup

75. Example: Sirius Application

76. Experiment: Cost Reduction

77. Experiment: Power Performance Trade-offs

78. Experiment: Power-Area Tradeoffs

79. Impact of PVT Variations on Power In sub 100nm DSM technologies, Process, Voltage and Temperature (PVT) variability is being observed, due to Increasing leakage power Use of power-aware design methodologies (voltage islands, DVS/DFS) PVT variability making it hard to achieve safe designs Results in significant fluctuations in power (and timing) Power (and timing) estimates from early in the design flow no longer valid Considerable effort required later in design flow to account for variability-induced fluctuations

80. Impact of PVT Variations on Power Impact of PVT variation on power dissipation of AMBA AHB bus matrix communication architecture (90nm) Significant (~10x variation) in power dissipation under different PVT conditions! Important to incorporate PVT variability at system-level during power exploration of on-chip communication architectures For more accurate and realistic power characterization

Hardware Design Space Exploration

Hardware Design Space Exploration

Presentation Transcript

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space exploration

Space exploration

Design Space Exploration with SimpleScalar

Architectural Design Space Exploration

Design Space Exploration

Space Exploration

Space Exploration

Space Exploration

Design Space Exploration

Space Exploration

Space Exploration

Space Exploration

Space exploration

Space Exploration