140 likes | 346 Views
Development of an efficient DSP Compiler based on Open64. Subrato K. De, Anshuman Dasgupta, Sundeep Kushwaha, Tony Linthicum, Susan Brownhill, Sergei Larin, Taylor Simpson Qualcomm Incorporated, San Diego & Austin, USA. Email:
E N D
Development of an efficient DSP Compiler based on Open64 Subrato K. De, Anshuman Dasgupta, Sundeep Kushwaha, Tony Linthicum, Susan Brownhill, Sergei Larin, Taylor Simpson Qualcomm Incorporated, San Diego & Austin, USA. Email: {sde, adasgupt, sundeepk, tlinth, yzhu, slarin, tsimpson}@qualcomm.com
Introduction • DSP specific extensions to the C/C++ language and support in Open64 for efficient usage of DSP hardware features. • Enhancement of WOPT with impact on DSPs for mobile devices. • Modification of the hyperblock scheduler for DSP with limited predication. • Register pair allocation. • Results at a glance.
DSP specific extensions to C/C++: use of circular load/store intrinsics int *coeffPointer; for(i=0; i < noOfInputSamples; ++i){ sum = 0; coeffPointer = &coeff[0]; for(j =0; j < 4; ++j){ sum += inputSample[i+j] * *coeffPointer++; } outputSample[i] = sum; } int *coeffPointer; int value; for(i=0; i < noOfInputSamples; ++i){ sum = 0; for(j =0; j < 4; ++j){ LOAD_CIRC_INT(value, coeffPointer, 1, 4, &coeff[0]); sum += inputSample[i+j] * value; } outputSample[i] = sum; }
DSP specific extensions to C/C++: use of bit-reversed load/store intrinsics unsigned int ReverseBits (unsigned int index, unsigned int NumBits ){ unsigned int i, rev; for ( i=rev=0; i < NumBits; i++ ){ rev = (rev << 1) | (index & 1); index >>= 1; } return rev; } for ( i=0; i < NumSamples; i++ ){ j = ReverseBits ( i, NumBits ); RealOut[j] = RealIn[i]; ImagOut[j] = ImagIn[i]; } int *realOutPointer; int *imagOutPointer; for ( i=0; i < NumSamples; i++ ){ STORE_BREV_INT(RealIn[i], realOutPointer, 1, NumSamples,&RealOut[0] ). STORE_BREV_INT(ImagIn[i], imagOutPointer, 1, NumSamples, &ImagOut[0] ). }
Design and implementation of circular and bit-reverse load/store intrinsics • Macros expansions. #define LOAD_CIRC_INT(v, p, s, l, a) \ ( (v) = *( p); (p) = (int *) circ_update((void *) (p), (s), (l), (void *) (a)) ) #define STORE_CIRC_INT(v, p, s, l, a) \ ( *( p) = (v); (p) = (int *) circ_update((void *) (p), (s), (l), (void *) (a)) ) • The WHIRL IR: • <result> = load_indirect(pointer); OR store_indirect(pointer) = <source>; • <pointer> = circ_update (pointer, step, CR). Where, CR is the configuration register that defines the circular/bit-reverse buffer based on buffer size and the start address. • The load/store and circular/bit-reverse update combined in the CG phase: • <result, pointer> =load_circular_update(pointer, step, CR), OR • <pointer> =store_circular_update(source, pointer, step, CR) • CRs are considered dedicated temporary names (TNs): • Algorithm for the hoisting of the loop-invariant CR assignment statements, and • Algorithm for CR allocation, when multiple target supports multiple CRs.
DSP specific extensions to the C/C++ : use and optimizations through loop pragmas • Loop pragmas: • #pragma LOOP_TRIP_COUNT_MIN (min) • #pragma LOOP_TRIP_COUNT_MAX(max) • #pragma LOOP_TRIP_COUNT_MODULO(modulo) • #pragma LOOP_UNROLL(N) • loop guard condition can be removed. • loop trip count computation: • can be simplified (e.g., divide replaced by right shifts). • can be replaced with a constant count value. • alternate loop during software pipelining may not be generated. • remainder loops during unrolling (LNO & GC) may not be needed. • Unrolling: the standard benefits • Less loop overhead by reducing the branch/jump/comparison. • Scheduling can be better due to increased number of operations in the loop body.
Register promotion of small structs (structures and unions <= 8 bytes that could fit in register) THE WHIRL IR USING DISTINCT AUXILIARY SYMBOLS FOR EACH FIELD IN THE SMALL STRUCT: e.g., h[1] == st 9; w[0] == st8 typedef union { long long d; int w[2]; short h[4]; char b[8]; } PAIR; PAIR var, *Pi, *Po; var.h[1] = 3; x = var.w[0]; I4INTCONST 3 (0x3) I2STID 2 <st 9> I4I4LDID 0 <st 8> I4STID 0 <st 5> I8I8LDID 0 <st 4> I4INTCONST 3 (0x3) I8CVTL 16 I8COMPOSE_BITS <bofst:16 bsize:16> I8STID 0 <st 4> I8I8LDID 0 <st 4> I8EXTRACT_BITS <bofst:0 bsize:32> I4STID 0 <st 5> EXTRACT_BITS & COMPOSE_BITS TRANSFORMATION REPLACES AUXILIARY SYMBOLS FOR EACH FIELD OF THE SMALL STRUCT MEMBER BY THE UNIQUE FULL-SIZED AUXILIARY REPRESENTING THE FULL STRUCTURE TYPE: e.g., PAIR == st 4
Hyperblock Formation • Hyperblock scheduling can be important on DSPs • Many DSP architectures support predicated instructions • Especially profitable on control code • Restricted form of predication on DSP targets • Not every instruction is predicatable • Architecture decisions mainly due to resource and power savings • Default hyperblock algorithm rejected too many basic blocks due to instructions that aren’t predicatable in the DSP • Most potential hyperblocks on our target contained a few basic blocks • Small hyperblocks exacerbated the problem • Needed a more permissive version of hyperblock formation
B1 B2 B3 B4 B5 Changes to Hyperblock Formation Algorithm • Allowed a basic block (e.g. B3) to contain non-predicatable instructions if it post-dominates the entry block (e.g. B1) of the hyperblock B1: Hyperblock entry blockB3:Block contains non-predicatable instructionsB2, B4, B5: large completely predicatable basic blocks. • Default hyperblock algorithm would reject blocks B1-B5 • Successfully formed hyperblock after our modifications • Modified algorithm led to more basic blocks considered and more hyperblocks formed. Details in paper.
Efficient Register Pair Allocation • Many DSPs support 64-bit register by combining two adjacent 32-bit registers. • Two choices: • Use single 8-byte TN • Handle all complexities in the register allocator. • Direct mapping of data-types from whirl to CGIR. • Simple semantics – 1 result, 2 or more operands. • Use two 4-byte TNs • Need to handle complexities in GRA/LRA to assign adjacent registers. • Need to keep a mapping between the 8-byte WN to its two 4-byte TNs. • Difficult to handle ops producing 8-byte result (two 4-byte TNs). • Difficult to handle ops using 8-byte source (each input is two 4-byte TNs) • Accessing lower / higher part of register pair is trivial.
Motivating example #include <stdio.h> extern long long bar(int a, long long b); extern int baz(long long); int foo(long long a, int c) { long long tmp = c + bar(1, a); int retval = (tmp>>32); if (c > 0) tmp++; return retval + baz(tmp); } • C Example • After Expansion • After EBO [ 11] TN249:8 :- asr_i_p TN242:8 (0x20) ; [ 11] TN252:4 :- tfr TN249:8 ; [ 11] TN1(r0):4 :- add TN248:4 TN252:4 ; [ 11] TN252:4 :- pseudo_pair_high GTN242:8 ; [ 11] GTN1(r0):4 :- add GTN1(r0):4<defopnd> TN252:4 ;
TN252:4 :- pseudo_pair_high GTN242:8 • Pseudo pair instructions are expanded after register allocation • NOP if high of GTN242:8 is allocated the same register as TN252:4. • Copy otherwise. • We want to avoid copy • Make sure both high of GTN242:8 and TN252:4 share the same register. • Problem • LRA and GRA phases work independently. • Live ranges of source TN and result TN can interfere in many ways. • GLOBAL GLOBAL Both handled in GRA • GLOBAL LOCAL Needs GRA LRA interaction • LOCAL GLOBAL Needs GRA LRA interaction • LOCAL LOCAL Both handled in LRA • LRA does not build the interference graph.
The solution strategy • Globalize any local TN if the other TN is global TN252:4 :- pseudo_pair_high GTN242:8 • Need to solve only 2 cases GLOBAL GLOBAL Both handled in GRA LOCAL LOCAL Both handled in LRA • Minor changes in GRA • GRA considers pseudo pair instructions as copy • Attempts to remove copy by preferencing • Some changes in LRA • Do very simple liveness analysis to identify if pseudo pair op TNs can preference i.e. share same register • Maintain a list of preference TN in each TN’s live range • When choosing color for a TN, check if any TN in its preference list is already allocated and get the same register if it is available
Results at a glance • Enhanced Open64 v.s. baseline Open64 improvements • WOPT enhancement: e.g., on modem applications • Cycle count reduced 3% to 40%, • stack reduced as much as 50%, • code size reduced 1% to 2% • Register-pair allocation: • Telecommunication: cycle count reduced 3.91% on average. • kernel codes: cycle count reduced 1.77% on average. • Hyperblock enhancement: (illustrated in terms of the ratio of HBs formed or BBs considered by the “modified algorithm” when compared to the “default algorithm”) • Networking: HBs formed 1.13; BBs considered 1.72; • Telecommunication: HBs formed 1.00; BBs considered 1.49;