810 likes | 1.19k Views
. Embedded Systems Design Flow. . . . . . . . Concept. . . Specification. HW/SWPartitioning. Hardware Components. Software Components. . Estimation -Exploration. Hardware. Software. Design(Synthesis, Layout, ). Design(Compilation, ). Validation and Evaluation (area, power, performance, ). .
E N D
1. Hardware Design Space Exploration+ Sudeep Pasricha
Colorado State University
CS/ECE 561 Fall 2011
2. Embedded Systems Design Flow
3. Design Space Exploration
ADL Driven Processor Memory Exploration
Communication Architecture Exploration
4. Traditional HW/SW Co-Design Flow --
5. ADL-Driven SOC Design Flow
6. ADL-driven Design Space Exploration
7. Exploration Methodology Four steps
Architecture Specification
Use of Architecture Description Language (ADL)
Software Toolkit Generation
Compiler, simulator, assembler, debugger
Generation of Hardware Models (Prototypes)
Design space exploration
Find the best possible architecture for the given set of application programs under area, power, performance constraints
8. Architecture Description Languages Behavior-Centric ADLs
ISPS, nML, ISDL, SCP/ValenC, ...
primarily capture Instruction Set (IS)
good for regular architectures, provides programmer’s view
tedious for irregular architectures, hard to specify pipelining
Structure-Centric ADLs
MIMOLA, ...
primarily capture architectural structure
specify pipelining; drive code generation, arch. synthesis
hard to extract instruction-set view
Mixed-Level ADLs
LISA, RADL, FLEXWARE, MDes, EXPRESSION, …
combine benefits of both
generate simulator and/or compiler
11. Example: Specification of the DLX Processor
12. Specification of the DLX Processor
13. Specification of the DLX Processor
14. Specification of the DLX Processor
15. Specification of the DLX Processor
16. Specification of the DLX Processor
17. Specification of the DLX Processor
18. Specification of the DLX Processor
19. V-SAT: Visual Specification and Analysis Tool
20. Illustrative example: TI C6x - I’ll use TI C6x architecture as an illustrative example to show how we capture
memory subsystem in ADL
- I will say few words about C6x architecture- I’ll use TI C6x architecture as an illustrative example to show how we capture
memory subsystem in ADL
- I will say few words about C6x architecture
21. Illustrative example: TI C6x - It has four fetch stages- It has four fetch stages
22. Illustrative example: C6x Architecture - Two decode stages - Two decode stages
23. Illustrative example: C6x Architecture - Two D unit stages. Other functional units are not shown.- Two D unit stages. Other functional units are not shown.
24. Illustrative example: C6x Architecture - Two memory controller stages- Two memory controller stages
25. Illustrative example: C6x Architecture - Finally novel memory subsystem consisting of
* partitioned register file
* Reconfigurable scratchpad SRAM.
* Two level of cache hierarchy containing L1 and L2 caches
* and off-chip DRAM with efficient access modes like
normal read, page read, burst read etc.
- So far I explained TI C6x processor and memory architecture, in the following slides
I’ll show how we capture memory architecture and integrate it with processor
description.- Finally novel memory subsystem consisting of
* partitioned register file
* Reconfigurable scratchpad SRAM.
* Two level of cache hierarchy containing L1 and L2 caches
* and off-chip DRAM with efficient access modes like
normal read, page read, burst read etc.
- So far I explained TI C6x processor and memory architecture, in the following slides
I’ll show how we capture memory architecture and integrate it with processor
description.
26. C6x Processor-Memory Pipeline Path - One of the important aspect of describing the memory architecture is integrating
the memory pipeline with the processor pipeline.
- Here I show an example, the pipeline path traversed by a particular load operation which is hit in L1 cache
- So the pipeline consists of
* Four fetch stages
* Two decode stages
* Two load/store stages
* Memory controller stage 1
* L1 controller stage 1 determines its a hit
* which takes it to L1_S2 stage and finally
* to Memory controller stage 2 which writes to register file.
- Even though this example pipeline path is flattened, we describe pipeline paths
in hierarchical manner.
- One of the important aspect of describing the memory architecture is integrating
the memory pipeline with the processor pipeline.
- Here I show an example, the pipeline path traversed by a particular load operation which is hit in L1 cache
- So the pipeline consists of
* Four fetch stages
* Two decode stages
* Two load/store stages
* Memory controller stage 1
* L1 controller stage 1 determines its a hit
* which takes it to L1_S2 stage and finally
* to Memory controller stage 2 which writes to register file.
- Even though this example pipeline path is flattened, we describe pipeline paths
in hierarchical manner.
27. TI C6x L1 Cache Characteristics - We capture the characteristics of each unit of the memory architecture. For instance,
for L1 cache we capture
* type of cache
* number of lines
* linesize
* wordsize
* associativity
* read/write latency
* replacement_policy
*write_policy
etc.
- We capture the characteristics of each unit of the memory architecture. For instance,
for L1 cache we capture
* type of cache
* number of lines
* linesize
* wordsize
* associativity
* read/write latency
* replacement_policy
*write_policy
etc.
28. DRAM “Normal Read” Pipeline - Similarly, we capture the characteristics of access modes. For instance, here we
show what to capture for DRAM in “normal read” access mode.
- For this mode, pipeline consists of three stages
* row_dec, col_dec followed by precharge.
* we also capture explicit timing of each pipeline stage
* for instance, row_dec takes 6 cycles, column dec takes 1 cycle and precharge takes 6 cycles.
- Once the memory subsystem is captured we are ready to generate memory-aware s/w toolkit to perform design space exploration and to give feedbacks to designers to show bottlenecks.
- Similarly, we capture the characteristics of access modes. For instance, here we
show what to capture for DRAM in “normal read” access mode.
- For this mode, pipeline consists of three stages
* row_dec, col_dec followed by precharge.
* we also capture explicit timing of each pipeline stage
* for instance, row_dec takes 6 cycles, column dec takes 1 cycle and precharge takes 6 cycles.
- Once the memory subsystem is captured we are ready to generate memory-aware s/w toolkit to perform design space exploration and to give feedbacks to designers to show bottlenecks.
29. Experimental Setup
- I present here a set of exploration experiments to demonstrate the usefulness of our approach.
- Experimental setup is shown here.
- We use TI C6x VLIW DSP
- We considered memory configurations with varieties of modules.
- We have chosen application programs from multimedia and dsp domains
Performed design space exploration with the goal of studying trade-off
between memory cost and processor performance.
- I present here a set of exploration experiments to demonstrate the usefulness of our approach.
- Experimental setup is shown here.
- We use TI C6x VLIW DSP
- We considered memory configurations with varieties of modules.
- We have chosen application programs from multimedia and dsp domains
Performed design space exploration with the goal of studying trade-off
between memory cost and processor performance.
30. Memory Subsystem Configurations Table shows the memory subsystem configurations we have used for design space exploration.
Each row corresponds to a memory configuration
Each column corresponds to one memory module used in the configuration.
For instance, Configuration 1 contains a small 128 words L1 cache and small 256 word stream buffer along with off chip DRAM.
Config 2, instead of stream buffer, has large 2K on-chip SRAM.
Third configuration replaces SRAM with L2 cache
Fourth configuration uses same L1 cache with different replacement policy, LRU instead of FIFO
Fifth configuration splits on-chip space between 1K L2 cache and 1K software controlled SRAM
Last configuration contains very large SRAM.
Configurations, 1 through 6, are shown in increasing order of cost in terms of memory area and the control logic.
Configuration 1 is lowest cost whereas configuration 6 is highest cost configuration. Table shows the memory subsystem configurations we have used for design space exploration.
Each row corresponds to a memory configuration
Each column corresponds to one memory module used in the configuration.
For instance, Configuration 1 contains a small 128 words L1 cache and small 256 word stream buffer along with off chip DRAM.
Config 2, instead of stream buffer, has large 2K on-chip SRAM.
Third configuration replaces SRAM with L2 cache
Fourth configuration uses same L1 cache with different replacement policy, LRU instead of FIFO
Fifth configuration splits on-chip space between 1K L2 cache and 1K software controlled SRAM
Last configuration contains very large SRAM.
Configurations, 1 through 6, are shown in increasing order of cost in terms of memory area and the control logic.
Configuration 1 is lowest cost whereas configuration 6 is highest cost configuration.
31. Design Space Exploration Results For each benchmark we present dynamic cycle count for the six memory configurations presented earlier
X-axis is benchmark we ran
Y-axis is the cycle count
For each benchmark the bar show the cycle count for the six configurations
The configurations are shown in increasing order of cost.
The blue one is config 1, lowest cost memory configuration
The red one is config 6, highest cost configurations
For each benchmark we present dynamic cycle count for the six memory configurations presented earlier
X-axis is benchmark we ran
Y-axis is the cycle count
For each benchmark the bar show the cycle count for the six configurations
The configurations are shown in increasing order of cost.
The blue one is config 1, lowest cost memory configuration
The red one is config 6, highest cost configurations
32. Design Space Exploration Results As expected, in some benchmarks higher memory cost generated better performance. For instance, ICCG, Integrate and Lowpass benchmarks
As expected, in some benchmarks higher memory cost generated better performance. For instance, ICCG, Integrate and Lowpass benchmarks
33. Design Space Exploration Results However this is not always true. For other benchmarks un-intuitive reverse trend i.e., lower cost memory configuration generated better performance for Hydro, Tridiag and Stateeq benchmarks
It is interesting to note that configuration 6 has worst performance for these three benchmarks.
However this is not always true. For other benchmarks un-intuitive reverse trend i.e., lower cost memory configuration generated better performance for Hydro, Tridiag and Stateeq benchmarks
It is interesting to note that configuration 6 has worst performance for these three benchmarks.
34. Cycle Accounting: Configuration 6 Cycle accounting for configuration 6 clearly explains the reason for the poor performance of configuration 6 for Hydro, tridiag and Stateeq.
Not all the arrays fit in the SRAM and the lack of L1 cache to compensate the large latency of the DRAM creates its toll on performance
These type of behavior is very hard to predict through analysis alone without simulation. Having such an explicit way to capture memory architecture and to drive DSE is crucial in order to give accurate feedback to designer or to evaluate processor-memory configurations.Cycle accounting for configuration 6 clearly explains the reason for the poor performance of configuration 6 for Hydro, tridiag and Stateeq.
Not all the arrays fit in the SRAM and the lack of L1 cache to compensate the large latency of the DRAM creates its toll on performance
These type of behavior is very hard to predict through analysis alone without simulation. Having such an explicit way to capture memory architecture and to drive DSE is crucial in order to give accurate feedback to designer or to evaluate processor-memory configurations.
35. Experimental Setup: DLX Case Study The Architecture
capture the DLX architecture using EXPRESSION
generate VHDL description
synthesize using Synopsys Design Compiler
The Application
FFT benchmark.
Explorations based on area, power and speed.
Addition of Functional Units (Pipeline Paths)
Addition of Pipeline Stages
Addition of Operations
36. Addition of Functional Units Schedule length = number of application cycles (relative)Schedule length = number of application cycles (relative)
37. Addition of Pipeline Stages
38. Addition of Operations
39. EXPRESSION Documentation and EXPRESSION toolkit available for download at
http://www.ics.uci.edu/~express/
Tested and runs fine with Visual C++ 6
40. Design Space Exploration
ADL Driven Processor Memory Exploration
Communication Architecture Exploration
41. Communication Architectures MPSoC communication architecture (CA) fabric has to cope with entire inter-component traffic
considerable impact on performance
leading cause of performance bottlenecks in MPSoC design
consumes anywhere between 20 – 50% overall system power
critical for battery-driven mobile applications
has thermal implications, affecting packaging cost and device failure rate
takes up significant chunk of design cycle
increasing time to market
Exploration of communication design space absolutely essential for embedded systems to meet
performance requirements
time-to-market deadlines
42. Exploration Challenge
43. 1. Component Mapping Exploration
44. Component Mapping Exploration
45. Component Mapping Exploration
46. Component Mapping Exploration
47. Component Mapping Exploration
48. 2. Automated Synthesis of Hierarchical Shared Bus Architectures
49. Automated Synthesis of Hierarchical Shared Bus Architectures
50. Framework Overview
51. Graph Partitioning Approach
52. Experiments: Case Study 1
53. Experiments: Case Study 1
54. Experiments: Case Study 1
55. Experiments: Case Study 1
56. Experiments: Case Study 1
57. Experiments: Case Study 2
58. Experiments: Case Study 2
59. Experiments: Case Study 2
60. Experiments: Case Study 2
61. Experiments: Case Study 2
62. Experiments: Case Study 2
63. Comparison with Other Schemes
64. 3. Automated Bus Matrix Synthesis
65. 3. Automated Bus Matrix Synthesis
66. 3. Automated Bus Matrix Synthesis
67. Problem Formulation
68. Synthesis Framework Overview
69. Branch and Bound Clustering
70. Branch and Bound Clustering
71. Branch and Bound Clustering
72. Branch and Bound Clustering
73. Synthesis Framework Overview
74. Experiment Setup
75. Example: Sirius Application
76. Experiment: Cost Reduction
77. Experiment: Power Performance Trade-offs
78. Experiment: Power-Area Tradeoffs
79. Impact of PVT Variations on Power In sub 100nm DSM technologies, Process, Voltage and Temperature (PVT) variability is being observed, due to
Increasing leakage power
Use of power-aware design methodologies (voltage islands, DVS/DFS)
PVT variability making it hard to achieve safe designs
Results in significant fluctuations in power (and timing)
Power (and timing) estimates from early in the design flow no longer valid
Considerable effort required later in design flow to account for variability-induced fluctuations
80. Impact of PVT Variations on Power Impact of PVT variation on power dissipation of AMBA AHB bus matrix communication architecture (90nm)
Significant (~10x variation) in power dissipation under different PVT conditions!
Important to incorporate PVT variability at system-level during power exploration of on-chip communication architectures
For more accurate and realistic power characterization