470 likes | 486 Views
This chapter explores the application of dependence analysis in C and C++ for hardware design, focusing on pointers, loops, scoping, problematic C dialects, and miscellaneous issues such as volatile variables. It delves into the challenges and solutions related to optimizing software development while considering hardware operations.
E N D
Other Applications of Dependence Allen and Kennedy, Chapter 12
Overview • So far, we’ve discussed dependence analysis in Fortran • Dependence analysis can be applied to any language and translation context where arrays and loops are useful • Application to C and C++ • Application to hardware design
Problems of C • C as “typed assembly language” versus Fortran as “high performance language” • C focuses more on ease of use and hardware operations • Post-increments, Pre-increments, Register variable • Fortran focus is on ease of optimization
Problems of C • In many cases, optimization is not desired while (!(t=*p)); • Optimizers would moves p outside the loop • C++ as well as other new languages focus more on simplified software development, at the expense of optimizability • Use of new languages has expanded into areas where optimization is required
Problems of C • Pointers • Memory locations accessed by pointers is not clear • Aliasing • C does not guarantee that arrays passed into subroutine do not overlap • Side-effect operators • Operators such as pre and post increment encourage a style where array operations are strength-reduced by the programmers
Problems of C • Loops • Fortran loops provides values and restrictions to simplify optimizations
Pointers • Two fundamental problems • A pointer variable can point to different memory locations during its use • A memory location can be accessed by more than one pointer variable at any given time, produces aliases for the location • Resulting in a much more difficult and expensive dependence testing
Pointers • Without knowledge of all possible references of an array, compilers must assume dependence • Analyzing entire program to find out dependence is solvable, but still unsatisfactory • Lead to the use of compiler options / pragmas • Safe parameters • All pointer parameters to a function point to independent storage • Safe pointers • All pointer variables (parameter, local, global) point to independent storage
Naming and Structures • In Fortran, a block of storage can be uniquely identified by a single name • Consider these constructs: p; *p; **p; *(p+4); *(&p+4);
Naming and Structures • Troublesome structures, such as unions • Naming problem • What is the name of ‘a.b’ ? • Different sized objects to overlap same storage • Reduce references to the same common unit of smallest storage possible
Loops • Lack of constraints in C • Jumping into loop body is permitted • Induction variable (if there’s any) can be modified in the body of the loop • Loop increment value may also be changed • Conditions controlling the initiation, increment, and termination of the loop have no constraints on their form
Loops • Rewrite as a DO loop • It must have one induction variable • That variable must be initialized with the same value on all paths into the loop • The variable must have one and only one increment within the loop • The increment must be executed on every iteration • The termination condition must match • No jumps from outside of the loop body
Scoping and Statics • Create unique symbols for variables with same name but different scopes • Static variables • Which procedures have access to the variable can be determined from the scope information • If it contains an address, then the content of that address can be modified by any other procedures
Problematic C Dialects • Use of pointers rather than arrays • Use of side effect operators • Complicates the work of optimizers • Need to be removed • Use of address and dereference operators
Problematic C Dialects • Requires enhancements in some transformations • Constant propagation • Treat address operators as constants and propagate them where is essential • Replace generic pointer inside a dereference with actual address • Expression simplification and recognition • Need stronger recognition within expression where variable is actually the ‘base variable’
Problematic C Dialects • Conversion into array references • Useful to convert pointers into array references • Induction variable substitution • Problem with strength reduction of array references • Expanding side-effect operators also requires changes
C Miscellaneous • Volatile variables • Functions with these variables are best left without optimization • Setjmp and Longjmp • Commonly used for error handling • Storing and loading current state of computation which is complex when optimization is performed and variables are allocated to registers • No optimization
C Miscellaneous • Varags and stdargs • Variable number of arguments • No optimization
Hardware Design: Overview • Today, most hardware design is language-based • Textual description of hardware in languages that are similar to those to develop software • Level of abstraction moving towards low level detailed implementation to high level behavioral specification • Key factor: compiler technology
Hardware Design: Overview • Four level of abstraction • Circuit / Physical level • Diagrams of electronic components • Logic level • Boolean equations • Register transfer level (RTL) • Control state transitions and data transfers, timing • Synthesis: conversion from RTL to its implementation • System level • Concentrate on behavior • Behavioral synthesis
Hardware Design • Behavior Synthesis is really a compilation problem • Two fundamental tasks • Verification • Implementation • Simulation of hardware is slow
Hardware Description Languages • Verilog and VHDL • Extensions in Verilog • Multi-valued logic: 0, 1, x, z • x = unknown state, z = conflict • E.g. division by zero produces x state • Operations with x will result in x state -> can’t be executed directly • Reactivity • Propagation of changes automatically • “always” statement -> continuous execution • “@” operator -> blocks execution until one of the operands change in value
Verilog • Reactivity always @(b or c) a = b + c; • Objects • Specific area of silicon • Completely separate area on the chip • Connectivity • Continuous passing of information • Input port and output port
Verilog • Connectivity module add(a,b,c) output a; input b, c; integer a, b, c; always @(b or c) a = b + c; endmodule
Verilog • Instantiation • Verilog only allows static instantiation integer x, y, z; add adder1(x,y,z); • Vector operations • Viewing other data structures as vector of scalars
Verilog • Advantages • No aliasing • Restriction of form of subscripts • Entire hardware design given to compilers at one time
Verilog • Disadvantages • Non-procedural continuation semantics • Lack of loops • Loops are implicitly represented by always blocks and the scheduler • Size
Optimizing simulation • Philosophy • Increases level of abstraction • Opts for less details • Inlining modules • HDLs have two properties that make module inlining simpler • Whole design is reachable at one time • Recursion is not permitted
Optimizing simulation • Execution ordering • The order in which the statement is executed can have a dramatic effect on the efficiency • Fast in hardware, but not in software • Grouping increases performance • Execute blocks in topological order based on the dependence graph of individual array elements • No memory overhead
Dynamic versus Static Scheduling • Dynamic scheduling • Dynamically track changes in values and propagate them • Mimics hardware • Overhead of change checks • Static scheduling • Blindly sweeps through all values for all objects regardless any changes • No need for change checks
Dynamic versus Static Scheduling • If the circuit is highly active, static scheduling is more suitable • In general, using dynamic scheduling guided by static analysis provides the best results
Fusing always blocks • High cost of change checks motivates fusing always blocks • Output of a design may change
Vectorizing always block • Regrouping low level operations back together to bring higher lever abstractions • Vectorizing the bit operations
Two state versus four state • Extra overhead in four state hardware • Few people like hardware that enters unknown states • Two state logic can be 3-5x faster • Utilization of two valued logic where ever possible • Finding out part executable in two state logic is difficult • Use interprocedural analysis
Two state versus four state • Test for detecting unknown is low cost, 2-3 instructions • Check for unknowns but default quickly to two state execution
Rewriting block conditions always @(posedge(clk)) begin sum = op1 ^ op2 ^ c_in; c_out = (op1 & op2) | (op2 & c_in) | (c_in & op1) end always @(op1 or op2 or c_in) begin t_sum = op1 ^ op2 ^ c_in; t_c_out = (op1 & op2) | … end always @(posedge(clk)) begin sum = t_sum; c_out = t_c_out; End
Basic Optimizations • Raise level of abstraction • Constant propagation and dead code elimination • Common subexpression elimination
Synthesis Optimization • Goal is to insert the details • Analogous to standard compilers • Harder than standard compilers • Not targeted towards a fixed target • No single goal. Minimize cycle time, area, power consumption
Basic Framework • Selection outweigh scheduling • Analogous to CISC • Body of tree matching algorithms • Needs constraints
Loop Transformations for(i=0; i<100;i++) { t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j]>>2); } for(i=0; i<100; i++) { o[i] = 0; for(j=0; j<100; j++) o[i] = o[i] +m[i][j] * t[j] }
Loop Transformations for(i=o; i<100; i++) t[i] = 0; for(i=0; i<100; i++) o[i] = 0; for(i=0; i<100; i++) for(j=0; j<3; j++) t[i] = t[i] + (a[i-j] >> 2) for(i=0; i<100; i++) for(j=0; j<100; j++) o[i] = o[i] + m[i][j] * t[j];
Loop Transformations for(i=0; i<100; i++) o[i] = 0; for(i=0; i<100; i++) t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j] >> 2); for(j=0; j<100; j++) o[j] = o[j] + m[j][i] * t[i];
Loop Transformation for(i=0; i<100; i++) { o[i] = 0; a0 = a[0]; a1 = a[-1]; a2 = a[-2]; a3 = a[-3]; for(i=0; i<100; i++) { t = 0; t = t + (a0>>2) + (a1>>2) + (a2>>2) + (a3>>2) a3 = a2; a2 = a1; a1 = a0; a0 = a[i+1]; for(j=0; j<100; j++) o[j] = o[j] + m[j][I] * t; } }
Control and Data Flow • Von Neumann architecture • Data movement among memory and registers • Control flow encapsulated in the program counter and effected with branches • Synthesized hardware • Data movement among functional units • Control flow is which functional unit should be active on what data at which time steps
Control and Data Flow • Wires • Immediate transfer • Latches • Values hold throughout one clock cycle • Registers • Static variables in c • Held in one or more clock cycle • Memories
Memory Reduction • Memory access is slow compared to unit access • Application of techniques • Loop interchange • Loop fusion • Scalar replacement • Strip mining • Unroll and jam • Prefetching
Summary • Not limited to Fortran • Have other applications • Early stage of research