300 likes | 387 Views
Hardware-Software Cosynthesis for Microcontrollers. Cosyma – a software oriented cosynthesis approach. software-oriented: Initially everything is implemented on software, external hardware is only generated when timing constraints are violated. (except for basic I/O).
E N D
Cosyma – a software oriented cosynthesis approach software-oriented: Initially everything is implemented on software, external hardware is only generated when timing constraints are violated. (except for basic I/O)
Cosyma – a software oriented cosynthesis approach software-oriented: Initially everything is implemented on software, external hardware is only generated when timing constraints are violated. (except for basic I/O) COSYMA = COSYnthesis for eMbedded Architecture
Partitioning of software • Partitioning is based on the software, software must have constructs supporting • partitioning • C often used for embedded systems, but has no support for partitioning • A superset of C was defined, and called CX • These extensions of C includes: • timing: minimum and maximum delays and duration between C labels of a task • task concept • task intercommunication
Partitioning problems • Goal of partitioning is to identify where constraints are violated. • Partitioning is done on different levels of granularity: • Coarse: task- and function level partitioning • Fine: basic block level and statement level • Using a finer granularity is more difficult because of: • Communication time overhead • Communication area overhead • Interlocks – waiting time • Compiler effects • The paper focuses on fine grain partitioning, with the basic block level as the • smallest partitioning unit.
The Cosyma system • For partitioning purposes, the system description (in CX) is translated into an • internal graph representation. • Requirements on this graph representation are: • A complete representation of all input constructs shall be possible • User influence on syntactic structure • The representation shall support partitioning and generation of hardware description for parts moved to hardware • Estimation techniques shall be possible
The Extended Syntax-graph • A typical control/dataflow graph cannot handle the two first requirements • ES graph: a syntax graph extended with: • symbol table • local data and control dependencies
The Extended Syntax graph • Each identifier has a pointer to its definition • Each definition has pointers to all its instances • Cost of communication can be determined using the symbol table • The symbol table can be used to get an upper bound on communication cost
def int i inst = f 1 lab1 for 5µs max inst inst inst inst *= = ≤ ++ f i i 1 i n i lab2 The Extended Syntax Graph from lab1 to lab2: max 5 us int i; f=1; lab1: for(i=1;i<n;i++) f *=I; lab2:
Data Dependency int x; int y; int z; x = a + b y = b * 3 z = x + y
def def def int int int = = = x z y + + * x y z b x a 3 b y inst inst inst int x; int y; int z; x = a + b y = b * 3 z = x + y
def def def int int int = = = x z y + + * z y x b x a y b 3 inst inst inst int x; int y; int z; x = a + b y = b * 3 z = x + y The “symbol table” contains no information about data dependencies between operations, which is required to perform a scheduling on the graph in order to estimate HW execution times.
def def def int int int = = = x z y + + * z x y a x b y b 3 inst a b 3 + * inst + x z y inst Basic Scheduling Block The ESG is overlayed with a second graph consisting of cross-linked blocks, Basic Scheduling Blocks x = a + b y = b * 3 z = x + y
HW/SW partitioning on ES graph • Mark node that is to be moved to HW • Generate C code by re-constructing CX code preserved in ES graph • Insert HW/SW communication protocol, analyze dataflow in graph • Generate object code for SW and check constraints by simulating • Generate HW using Olympus synthesis system
Iterative approach for partitioning Since it will take too much effort to evaluate the whole system when testing different partitioning schemes (synthesis, compilation and runtime analysis) an “inner loop” simulation based on a cost function is done. Initially, the chosen solution is non-feasible with regard to time constraints, and use a cost function with high penalty for solutions with run-times exceeding the time constraint ES Graph Partitioning Cost Estimation
Hardware Extraction Process • A cost-function that favors implementations in hardware is used. This cost • function takes into consideration: • knowledge of synthesis • compilers • pre-defined HW libraries • Several specialized cost functions can work in parallel to extract different types of hardware.
Hardware Extraction Process • A cost-function for extracting computational intensive parts • Use simulation and profiling for identifying these parts, determine: • Number of times a node has been executed • Potential speedup through HW synthesis • Communication time penalty
Hardware Extraction Process • Speedup is estimated using • An operator table holding the execution times of the function units used in synthesis • Scheduling of the operations in the ES graph to estimate potential concurrency • Communication time penalty is estimated by • Dataflow analysis (# variables to be transferred) • # clock cycles for variable transfer
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B)
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) exponential weighting of runtime above the given constraints
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) exponential weighting of runtime above the given constraints a(TC,TS) = sign(TC – TS)*exp[(TC – TS)/T] where TC = time constraint TS = resulting time needed by the hardware-software system T = constant factor
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) tneff = effective hardware timing
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) tneff = effective hardware timing tcom = communication overhead
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) tneff = effective hardware timing tcom = communication overhead tHW-SW = time overlap between hardware and software in case of parallel execution (equals zero in the paper)
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) tneff = effective hardware timing tcom = communication overhead tHW-SW = time overlap between hardware and software in case of parallel execution (equals zero in the paper) tSW = runtime when B is implemented in software
The Cost Function Uses data from pre-processing stages (for example number of clock cycles for varible transfer to co-processor). When a basic block B is moved to hardware, the cost increment dc is defined as: dc(B)=a(TC,TS)*[tneff(B)+tcom(B) – tHW-SW(B) – tSW(B)]*It(B) tneff = effective hardware timing Number of times B was executed during profiling tcom = communication overhead tHW-SW = time overlap between hardware and software in case of parallel execution (equals zero in the paper) tSW = runtime when B is implemented in software
CoP SW SW x = a + b y = b * 3 // B z = x + y 0 x=a+b x=a+b 2 Tcomm = 3 TC=5 THW-SW = 0 It(B) = 1 sto(b) 4 TSW = 10 sto(3) 6 y=b*3 y=b*3 tneff cost function as a function of tneff 8 sto(y) B 10 z=x+y 12 14 z=x+y 16
Selection of BSBs • Large number of possible HW partitions possible • Estimate costs for adjacent BSBs in control flow to reduce preprocessing effort for communication costs. • Several BSBs can be moved to HW, to avoid redundant variable exchange the communication between BSBs must be considered. • The finer granularity, the larger impact has communication
Results from experiments • In the examples, speedup was 1.3 – 3 • Communication overhead more important factor than number of iterations • Important to consider compiler optimization
About this paper • When written, only one other partitioning system supports automatic partitioning • CX - An extension to C for system design • Extended syntax graph for system analysis and partitioning • Hardware extraction process using cost function • Examples showing importance of communication overhead