270 likes | 370 Views
Generating Hardware Designs by Source Code Transformation. Ashley Brown, Wayne Luk, Paul Kelly STS ‘06. What would we like to do?. Take an algorithm in written in C. Generate an efficient hardware design, run it on an FPGA. Fast design cycle, easy to maintain code.
E N D
Generating Hardware Designs by Source Code Transformation Ashley Brown, Wayne Luk, Paul KellySTS ‘06
What would we like to do? • Take an algorithm in written in C. • Generate an efficient hardware design, run it on an FPGA. • Fast design cycle, easy to maintain code. • C programmers should be able to create fast hardware! 21st June 2005 | Ashley Brown
Background: Handel-C • C-based programming language for digital system design. • One clock-cycle per statement. • Explicit parallelism. • Compiler generates hardware design from Handel-C source. while (j != 3) { par { t0 = aa[0] * bb[0]; t1 = aa[1] * bb[1]; } par { cc[i][j] = t0 + t1; j++; } } Handel-C code example. 21st June 2005 | Ashley Brown
Problems • Software programmers: Bad Handel-C, poor hardware. • No exploitation of statement-level parallelism. • Long expressions. • Lots of for loops! • Experienced Handel-C designers: good hardware, hard to read code. • Trickery to reduce clock cycles, increase clock rate. • Finding the “optimal” solution is not easy. • Optimisation effectiveness depends on the target architecture (see the results later!) 21st June 2005 | Ashley Brown
Solutions • Restructure Handel-C code to optimise. • Can parallelise if desired. • Duplicate hardware if necessary. • Apply transformations to the original source, leaving it intact. • The original readable description is still available. • A more efficient version is used for hardware generation. • Allow the user to define custom transformations with a transformation language. • Generate a whole design-space of solutions, with different optimisations. 21st June 2005 | Ashley Brown
What’s New? • Previous work with user-specified transformations has been: • For software-based C. • Aimed at parallelising/optimising for microprocessors • Can’t duplicate microprocessor hardware on the fly – it’s either there or not.We can duplicate hardware, pipeline – FASTER DESIGN! • Previous work on hardware language transformations do not allow the user to describe transformations (Haydn-C).We do – the user can target their code explicitly. • Exploring an entire design-space is usually done at the hardware level, not high-level language (although not always, e.g. ASC).We generate a full design-space – find *the* best solution. 21st June 2005 | Ashley Brown
Basic Components CML transformations are defined within transform blocks. The optionalalways keyword indicates that this transformation should always be applied where it can. The pattern section describes the format of the code to match for this transformation. Each transformation can have a name to identify it for reporting. ) The generate section describes the code should replace the pattern. Wildcards, such as cmlexpr, allow a pattern to be matched and substituted into the new tree • Wildcard matching: • cmlexpr - matches any expression • cmlstmt - matches any statement • cmlstmtlist - matches a list of statements // 1 * x = x always transform std _ times 1 _ elim { pattern { cmlexpr 1 * ( operand ) } generate { cmlexpr ( operand } } 21st June 2005 | Ashley Brown
Ensuring Data Integrity • Three types of condition are defined to ensure data integrity: • Data-flow sets. • Expression evaluation. • Constant validation. • Transformations have a conditions section to define these. 21st June 2005 | Ashley Brown
Hand-coded vs Automated Sequential Hand-coded Automated do{ if(A >= B){ A -= B; C = (C<< 1) |1; } else { C <<1; } B >>=1; Bits--; } while(Bits !=0); do{ par { if(A >= B){ par { A -= B; C = (C<< 1) |1; } } else { C <<1; } B >>=1; Bits--; } } while(Bits !=0); do{ par { if(A >= B){ par { A -= B; C = (C<< 1) |1; } } else { C = (C <<1); } B = (B >>1); Bits = (Bits – 1); } } while(Bits !=0); 21st June 2005 | Ashley Brown
Test Transformations • Generic – applicable to all programs: • autopar – parallelise sequential statements with no dependencies. • fortowhile – convert for loops into corresponding while loops. • lttoeq – convert for loops with < in the loop condition to ==. • Application specific – targetted at the test programs: • matrixpar – parallelisation of an inner loop. 21st June 2005 | Ashley Brown
More Transformations • Various mathematical rearrangments: • Factorise to reduce multiplies. • Remove *1, *0, +0 etc. • More interesting: • Dead-code elimination (remember data conditions!) • Variable replacement • remove dependencies in code by replacing variables with the expressions assigned to them last (again, remember data conditions!) 21st June 2005 | Ashley Brown
Execution Time Improvement lttoeq increases fmax on Altera, but decreases it on Xilinx Execution Time (s) Optimisation Applied (Optimisations are Cumulative) 21st June 2005 | Ashley Brown
Design-Space Exploration • Difficult to decide which transformation is best. • Don’t guess, produce several solutions. • Branch the AST whenever a transformation is applied. • In-place branches: small AST. • Propagate branches when no more transformations can be applied. • Repeat transformation process on each new solution. 21st June 2005 | Ashley Brown
Design Space Exploration 21st June 2005 | Ashley Brown
Design Space Exploration • Assume design with an fmax of 104MHz, must match that. • Many solutions matching. • we should consider other factors such as area, power or number of cycles. • Being brief: look at solutions 139 and 232. • Only partially parallelised. Solution with most parallelism (239) does not meet the fmax requirement. 21st June 2005 | Ashley Brown
Future Work • Extensions to the language to allow additional matching. • expr replicator, complex expression matching. • Preservation of structure – e.g. a++; does not become a = a + 1; • Heuristics for selecting transformations to apply. • Genetic algorithms for transformation selection? “Breed” good transformation solutions. 21st June 2005 | Ashley Brown
Future Applications • Aspect-oriented concepts: automatically inserting debugging signals. • Power-signature-masking code to avoid attacks in cryptographic applications. 21st June 2005 | Ashley Brown
Conclusion • Matching method can achieve good results on naïve C code. • Targeting domain- or application-specific constructs can provide large performance gains at the expense of resources. • Scope to produce a much more powerful system with changes to the transformation language, heuristics and more efficient algorithms. 21st June 2005 | Ashley Brown
Contributions • The first transformation language for parallelising hardware languages with data integrity conditions. • A prototype transformation engine for implementing the language. • Automatic transformations capable of achieving a 35-70% reduction in execution time. • An insight into the interaction of transformations, both with each other and with the platform their output runs on. 21st June 2005 | Ashley Brown
Cycle Count Improvements 21st June 2005 | Ashley Brown
Design-Space Exploration Transform, creating a branch point. 21st June 2005 | Ashley Brown
Design-Space Exploration Propagate branches to root – create several distinct solutions. 21st June 2005 | Ashley Brown
Conventional Array Access Congestion 21st June 2005 | Ashley Brown
Rotational Shift Array Access Distributed Accesses 21st June 2005 | Ashley Brown