640 likes | 755 Views
Compiler optimizations based on call-graph flattening Carlo Alberto Ferraris professor Silvano Rivoira Master of Science in Telecommunication Engineering Third School of Engineering: Information Technology Politecnico di Torino July 6 th , 2011. Increasing complexities.
E N D
Compiler optimizationsbased on call-graphflatteningCarlo Alberto Ferrarisprofessor Silvano RivoiraMaster of Science in Telecommunication EngineeringThird School of Engineering: Information TechnologyPolitecnico di TorinoJuly 6th, 2011
Increasingcomplexities Everydayobjects are becomingmulti-purposenetworkedinteroperablecustomizablereusableupgradeable
Increasingcomplexities Everydayobjects are becomingmore and more complex
Increasingcomplexities Software thatrunssmartobjectsisbecomingmore and more complex
Diminishingresources Systemshavetoberesource-efficient
Diminishingresources Systemshavetoberesource-efficient Resources come in manydifferentflavours
Diminishingresources Systemshavetoberesource-efficient Resources come in manydifferentflavours Power Especiallyvaluable in battery-poweredscenariossuchas mobile, sensor, 3rd world applications
Diminishingresources Systemshavetoberesource-efficient Resources come in manydifferentflavours Power, density Criticalfactor in data-center and product design
Diminishingresources Systemshavetoberesource-efficient Resources come in manydifferentflavours Power, density, computational CPU, RAM, storage, etc. are oftengrowingslowerthan the potentialapplications
Diminishingresources Systemshavetoberesource-efficient Resources come in manydifferentflavours Power, density, computational, development Developmenttime and costsshouldbeas low aspossiblefor low TTM and profitability
Diminishingresources Systemshavetoberesource-efficient Resources come in manynon-orthogonalflavours Power, density, computational, development
Abstractions Weneedtomodularize and hide the complexity Operatingsystems, frameworks, libraries, managedlanguages, virtualmachines, …
Abstractions Weneedtomodularize and hide the complexity Operatingsystems, frameworks, libraries, managedlanguages, virtualmachines, … Allofthiscomeswith a cost: genericsolutions are generallylessefficientthanad-hocones
Abstractions Weneedtomodularize and hide the complexity Palm webOS User interface running on HTML+CSS+Javascript
Abstractions Weneedtomodularize and hide the complexity Javascript PC emulator Running Linux inside a browser
Optimizations Weneedtomodularize and hide the complexitywithoutsacrificing performance
Optimizations Weneedtomodularize and hide the complexitywithoutsacrificing performance Compiler optimizationstrade off compilation timewithdevelopment, executiontime
Vestigialabstractions The naturalsubdivisionof code in functionsismaintained in the compiler and all the way down to the processor Eachfunctionisself-containedwithstrictconventionsregulatinghowitrelatestootherfunctions
Vestigialabstractions Processors don’t care aboutfunctions; respecting the conventionsis just additional work Push the contentsof the registers and returnaddress on the stack, jumpto the callee; execute the callee, jumpto the returnaddress; restore the registersfrom the stack
Vestigialabstractions Manyoptimizations are simplynotfeasiblewhenfunctions are present int replace(int* ptr, int value) { inttmp = *ptr; *ptr = value; return tmp; } int A(int*ptr, intvalue) { returnreplace(ptr, value); } int B(int*ptr, intvalue) { replace(ptr, value); returnvalue; } void*malloc(size_tsize) { void*ret; // [variouschecks] ret = imalloc(size); if (ret == NULL) errno = ENOMEM; returnret; } // ... type *ptr = malloc(size); if (ptr == NULL) return NOT_ENOUGH_MEMORY; // ...
Vestigialabstractions Manyoptimizations are simplynotfeasiblewhenfunctions are present interpreter_setup(); while (opcode = get_next_instruction()) interpreter_step(opcode); interpreter_shutdown(); function interpreter_step(opcode) { switch (opcode) { case opcode_instruction_A: execute_instruction_A(); break; case opcode_instruction_B: execute_instruction_B(); break; // ... default: abort("illegal opcode!"); } }
Vestigialabstractions Manyoptimizationefforts are directed at workingaround the overheadcausedbyfunctions Inliningclones the body of the callee in the caller; optimalsolutionw.r.t.callingoverheadbutcauses code sizeincrease and cache pollution; usefulonly on small, hot functions
Call-graphflattening Whatifwedismissfunctionsduringearlycompilation…
Call-graphflattening Whatifwedismissfunctionsduringearly compilation and track the control flow explicitelyinstead?
Call-graphflattening Whatifwedismissfunctionsduringearly compilation and track the control flow explicitelyinstead?
Call-graphflattening Whatifwedismissfunctionsduringearly compilation and track the control flow explicitelyinstead?
Call-graphflattening Wegetmostbenefitsofinlining, including the abilitytoperformcontextual code optimizations, without the code sizeissues
Call-graphflattening Wegetmostbenefitsofinlining, including the abilitytoperformcontextual code optimizations, without the code sizeissues Where’s the catch?
Call-graphflattening The load on the compiler increasesgreatlybothdirectly due to CGF itself and alsoindirectly due tosubsequentoptimizations Worse case complexity (numberofedges) isquadraticw.r.t. the numberofcallsitesbeingtransformed (heuristicsmay help)
Call-graphflattening During CGF weneedtostaticallykeeptrackofall live valuesacrossallcallsites in allfunctions A valueisaliveifitwillbeneeded in subsequentinstructions A = 5, B = 9, C = 0; // live: A, B C = sqrt(B); // live: A, C return A + C;
Call-graphflattening Basically the compiler hastostatically emulate ahead-of-timeall the possiblestackusagesof the program Thishasalreadybeendone on microcontrollers and resulted in a 23% decreaseofstackusage (and 5% performance increase)
Call-graphflattening The indirect cause ofincreased compiler loadcomesfrom standard optimizationsthat are runafter CGF CGF doesnot create newbranches (eachcall and returninstructionisturnedexactelyinto a jump) butotheroptimizations can
Call-graphflattening The indirect cause ofincreased compiler loadcomesfrom standard optimizationsthat are runafter CGF Mostoptimizations are designedto operate on smallfunctionswithlimitedamountsofbranches
Call-graphflattening Manypossibleapplicationscenariosbesideinlining
Call-graphflattening Manypossibleapplicationscenariosbesideinlining Code motion Moveinstructionsbetweenfunctionboundaries; avoidunneededcomputations, alleviate registerpressure, improve cache locality
Call-graphflattening Manypossibleapplicationscenariosbesideinlining Code motion, macro compression Findsimilar code sequences in differentpartsof the code and mergethem; reduce code size and cache pollution
Call-graphflattening Manypossibleapplicationscenariosbesideinlining Code motion, macro compression, nonlinear CF CGF supportsnativelynonlinearcontrolflows; almost-zero-cost EH and coroutines
Call-graphflattening Manypossibleapplicationscenariosbesideinlining Code motion, macro compression, nonlinear CF, stacklessexecution No runtimestackneeded in fully-flattenedprograms
Call-graphflattening Manypossibleapplicationscenariosbesideinlining Code motion, macro compression, nonlinear CF, stacklessexecution, stackprotection Effectivestackpoisoningattacks are muchharder or evenimpossible
Implementation To test if CGF isapplicablealsotocomplexarchitectures and to validate some of the ideaspresented in the thesis, a pilotimplementationwaswrittenagainst the open-source LLVM compiler framework
Implementation Operates on LLVM-IR; host and target architectureagnostic; roughly 800 linesof C++ code in 4 classes The pilotimplementation can notflattenrecursive, indirect or variadiccallsites; they can beusedanyway
Implementation Enumerate suitablefunctions Enumerate suitablecallsites (and their live values) Create dispatchfunction, populatewith code Transformcallsites Propagate live values Removeoriginalfunctions or create wrappers
Examples int a(int n) { return n+1; } int b(int n) { inti; for (i=0; i<10000; i++) n = a(n); return n; }
int a(int n) { return n+1; } int b(int n) { inti; for (i=0; i<10000; i++) n = a(n); return n; }
int a(int n) { return n+1; } int b(int n) { inti; for (i=0; i<10000; i++) n = a(n); return n; }
Examples int a(int n) { return n+1; } int b(int n) { n = a(n); n = a(n); n = a(n); n = a(n); return n; }
int a(int n) { return n+1; } int b(int n) { n = a(n); n = a(n); n = a(n); n = a(n); return n; }
.type .Ldispatch,@function .Ldispatch: movl $.Ltmp4, %eax # store the return dispather of a in rax jmpq *%rdi # jump to the requested outer disp. .Ltmp2: # outer dispatcher of b movl $.LBB2_4, %eax # store the address of %10 .Ltmp0: # outer dispatcher of a movl (%rsi), %ecx # load the argument n in ecx jmp .LBB2_4 .Ltmp8: # block %17 movl $.Ltmp6, %eax jmp .LBB2_4 .Ltmp6: # block %18 movl $.Ltmp7, %eax .LBB2_4: # block %10 movq %rax, %rsi incl %ecx # n = n + 1 movl $.Ltmp8, %eax jmpq *%rsi # indirectbr .Ltmp4: # return dispatcher of a movl %ecx, (%rdx) # store in pointer rdx the return value ret # in ecx and return to the wrapper .Ltmp7: # return dispatcher of b movl %ecx, (%rdx) ret