200 likes | 353 Views
Memory Efficient Software Synthesis from Dataflow Graph. Wonyong Sung, Junedong Kim, Soonhoi Ha Codesign and Parallel Processing Lab. Seoul National University. Contents. Introduction Code Generation from Block Diagram Specification Synchronous Data Flow and Single Appearance Schedule
E N D
Memory Efficient Software Synthesis from Dataflow Graph Wonyong Sung, Junedong Kim, Soonhoi Ha Codesign and Parallel Processing Lab. Seoul National University
Contents • Introduction • Code Generation from Block Diagram Specification • Synchronous Data Flow and Single Appearance Schedule • Proposed Strategies • Optimization 1 : code sharing optimization • Optimization 2 : minimize buffer requirement • Experiments • Conclusions
Introduction • Motivations • Embedded system has limited amount of memory • large program = memory cost, performance penalty, power consumption • New trend of software development : high level design methodology • growing complexity, fast design turn-around time, limited budget, etc. • Goal of Research • Reduce the code and data size of automatically generated software • In an automatic software synthesis environment • Specification = Dataflow graph with SDF(Synchronous DataFlow) semantics
Software Synthesis from SDF graph main(){ for(i=0;i<6;i++){A} for(i=0;i<4;i++){B} for(i=0;i<3;i++){C} for(i=0;i<2;i++){D} } main(){ for(i=0;i<2;i++){ for(j=0;j<3;j++){A} for(j=0;j<2;j++){B} } for(i=0;i<3;i++){C} for(i=0;i<2;i++){D} } B 3 1 2 2 A D 1 3 2 2 C Possible Schedules : = AABCABACDABABCD = (6A)(4B)(3C)(2D) = (2(3A2B))(3C)(2D) Single Appearance Schedule (SAS)
Previous Efforts • Single Appearance Schedule (SAS): APGAN,RPMC • [by Battacharyya et. al.] in Ptolemy Group • SAS guarantees the minimum code size (without code sharing) • APGAN,RPMC : heuristics to find data minimized SAS schedule • ILP formulation for data memory minimization • [by Ritz et. al.] in Meyr Group • flat single appearance schedule + sharing of data buffer • Rate optimal compile time schedule • [by Govindarajan et. al.] in Gao Group • tried to minimize the buffer requirement using linear programming • An algorithm to compute the smallest data buffer size • [by Ade et. al.] in GRAPE group
Proposed Strategies • Coding style • not stuck to one coding style, hybrid approach • generated code is a mixture of inlines and functions • Optimization 1: Code Sharing • Multiple instances of a same kernel treated as different node in SAS • Code sharing optimization has gain(block size) and cost(context size) • Optimization 2: Schedule Adjustment • give up single appearance schedule to reduce the data size • (1) represents schedule information with BTLC data structure • (2) find possible location for adjustment • (3) schedule adjustment
BTLC Flowchart of Optimization Procedure Get SAS schedule [RPMC,APGAN] code-block size context size Code sharing optimization Schedule Adjustment C code generation
Example of Code Sharing (CD2DAT) ramp sine fir1 fir2 fir3 fir4 xgraph ramp’ sine’ xgraph Code before sharing for(int i=0;i<2;i++) { { /* code for fir1 */ ……………… out = tap*input[i]’ ……………… } } /* code for fir 2 */ …………….. Code after sharing for(int i=0;i<2;i++) fir(1); for(int i=0;i<3;i++) fir(2); …………… void fir(int context){ ……………… context_FIR[context].out... ……………… } context definition typedef struct{ double *out; int output_ofs; int output_bs; int output_nx; …………. double decimation; double tap; }context_FIR;
Code Size Overhead (in Sparc/Solaris) without context with context ….. = value; ….. = *(context_CGCRamp[context].value); ldd [%fp + -336],%o0 sethi %hi(0x20800),%o1 ld [%o1+0x3c8], %o0 mov %o0, %o2 sll %o2, 2, %o1 add %o1, %o0, %o1 sll %01, 3, %o0 add %fp, -424, %o1 add %o1, %o0, %o2 ld [%o2 + 0x1c], %o0 ldd [%o0], %o2 4 bytes 40 bytes Reference Overhead = 36 bytes!
Optimization 1 : Code sharing • Multiple instances of a same kernel have their own contexts • Kernel code should be transformed into shared version function • Shared Version • references are only through context variable • Gain and cost of sharing • Gain = (# instances -1) (code block size) • Cost = (#instances) (context variable size) + (code block overhead) • Code sharing is performed only when the gain is larger than the cost
1 > (-1) (-1) > > + > context + reference + > context + shared Decision Formula (1) = code sharing overhead = context + reference (2) context = pi(pi), pi ports where, (x) = 3*sizeof(int) + sizeof(pointer) (3) reference = t {S,C,AS,AP}((t)(t)) (t) = reference count (t) = unit overhead t = type of reference (4) = code block size (5) = number of instances
C A B 3 7 5 6 A B C Optimization 2 : Adjusting SAS • Adjusting Single Appearance Schedule • 2(7A3B)5C ==> 51 • 2(7A3B2C)C ==> 39 • give up single appearance schedule • BTLC (Binary Tree with Leaf Chain) G 5 2 [6,0,0] = [input, inside, output] 3 [0,0,21] 7 [21,0,15] [7,0,5] [0,0,3]
G [0,30,0] 5 2 [30,0,0] [0,21,30] 3 [0,0,21] 7 [21,0,15] C In general [7,0,5] [0,0,3] 30 21 W A B W = |OL IR| I = | IL IR - W | O = | OL OR -W | L R [I,W,O] Computation of Buffer Requirements 2 7 3 7 5 3 A B 21 30
BTLC Flowchart of Schedule Adjustment SAS schedule Construct BTLC Compute buffer requirement Find candidate for adjustment no found yes Adjust schedule (split a chain) Done code generation
G 2 5 7 3 C A B Splitting A Chain • Finding split candidate • a chain which has the largest number • in this example BC is selected • Schedule after splitting • 2(7A3B2C)C • In general, for a schedule that has two clusters aCabCb(a and b are loop counts) new schedule is defined as • a(Ca(b/a)Cb)(b%a)Cb) , if a<b • (a%b)Ca b((b/a)CaCb ), otherwise [0,30,0] [30,0,0] [0,21,30] [21,0,15] [0,0,21] [6,0,0] 30 21 Split point [0,0,3] [7,0,5] Schedule = 2(7A3B)5C
Decision Formula G [0,6,0] [0,12,6] [6,0,0] 2 1 [12,0,0] [0,21,15] 1 2 C |Cluster| = |W| value of the cluster [6,0,0] [21,0,15] 6 • New Schedule • 2(7A3B2C)C • Gain = 12 [0,0,21] 7 3 C [6,0,0] 12 21 A B [7,0,5] [0,0,3]
[0,280,0] G [0,56,280] [280,4,0] 7 40 [56,0,40] [0,6,56] [7,0,4] [4,0,0] F4 4 7 8 [6,0,8] [0,1,6] 280 4 3 2 F3 X1 [1,0,0] [1,0,2] [7,0,5] [0,1,1] 56 1 F1 [0,35,0] 6 1 [0,1,2] G [3,0,4] F2 1 X2 [1,0,0] [0,35,35] 1 [0,2,1] [35,4,0] 7 5 1 fork [1,0,2] [0,56,40] [35,4,0] 1 [0,1,2] 1 5 [7,0,4] [4,0,0] F4 4 1 M [2,0,1] [4,0,0] [56,35,40] 2 [0,6,56] [0,0,2] 4 [7,0,4] F4 4 7 8 1 S2 [1,0,1] X1 [1,0,0] [6,0,8] 35 1 35 [0,1,1] 4 2 1 R2 [0,0,1] F3 X1 [7,0,5] [1,0,0] 0 [0,0,1] 56 1 R1 S1 [1,0,1] [3,0,4] F2 Experiment : CD2DAT
Experimental Result Program size after each optimization CD2DAT Filter Bank SAS 13672 28512 Code Sharing 12768 22024 Schedule Adjustment 12296 22024 Memory behavior of CD2DAT in ARM7 Fetches Miss SAS 17098177 57189 Code Sharing 17573923 52867 Schedule Adjustment 17499386 54331
Conclusion • Our Environment • PeaCE : Ptolemy extension as Codesign Environment • Optimization Techniques in Software Synthesis • For automatic code generation from dataflow graph • Joint minimization of code and data size • Selective application code sharing and schedule adjustment to SAS • Future works • Clustering : multiple fine grain nodes into a large one • increase chance of code sharing • Buffer sharing • further reduce the buffer size and increase the cache effect