180 likes | 306 Views
IC-Parc William Penney Laboratory. Towards Dynamic Instrumentation for Performance Optimisation Andy Cheadle. Overview. Motivation Profiling Techniques The Aspect Oriented Paradigm Instrumentation Profiling in ECLiPSe New ECLiPSe Tools instrument instprofile modeanalyser papi
E N D
IC-Parc William Penney Laboratory Towards Dynamic Instrumentation for Performance Optimisation Andy Cheadle
Overview Motivation Profiling Techniques The Aspect Oriented Paradigm Instrumentation Profiling in ECLiPSe New ECLiPSe Tools instrument instprofile modeanalyser papi Ongoing Work
Motivation ECLiPSe Platform Developer How does ECLiPSe perform on stock hardware? Cache utilisation - instruction / data Emulator instruction profiling Evaluation of core runtime system services implementation ECLiPSe Application Developer How is my application performing? Where is the majority of the runtime spent? Why does my program spend so much time garbage collecting? More generally what is the pattern of resource usage over program execution? Optimising at search level is not enough Ease of bottleneck identification is key to program optimisation!
Profiling Techniques Sample based profiling Measurement recorded at a (fixed) interval Low overhead - non-intrusive Relatively coarse-grained Indicates trend of resource usage over time Misses spikes in resource usages occurring between samples Hard to accurately ascribe measurement to code location / exact instruction Instrumentation based Profiling Insert measurement code around mutator code fragments Greater overhead – code insertion can be highly intrusive Granularity is determined by size of code fragment Accurate profiling of code fragment Captures all events occurring within fragment Measurements ascribed to exact mutator code location by callsite identifier
The Aspect Oriented Paradigm Programming (AOP) Software Engineering to achieve separation of concerns Targets cross-cutting concerns Composition filters and aspects Languages HyperJ, AspectJ Domain specific languages - Template Haskell, MetaOCaml Old hat to logic programmers - metaprogramming! Applications Policies of distributed computing (security, deployment) Logging, tracing and error reporting Legacy OO code migration Instrumentation based code profiling
Instrumentation Profiling in ECLiPSe Challenges of the imperative world +... Meta-called arguments, i.e. meta-predicates Resatisfiability Cut (!!!!) Call Fail Box Model of Execution Redo Exit
Fail events Anonymous events event_create(+Goal, -EventHandle) [eclipse 1]: event_create( writeln('Goodbye cruel world!'), Event), writeln('Hello world!'), event(Event). Hello world! Goodbye cruel world! Event = 'EVENT'(16'503f0238) Yes (0.00s cpu) Garbage collection of embedded handles Timeout library Supports nested timeouts (time-aware search) timeout/3,timeout/7,call_timeout_safe/1 Kernel Enhancements
:- lib(instrument) Tool for instrumentation of predicate definitions with user-defined predicates Similar concept to AO instrumentation, but aspects are specified as templates not using language constructs module:foo/n = itemplate with [..] Arity 25! 19 of which define instrumentation points clause, block, subgoal and call each with *_start, *_end, *_fail, *_redo points fact, inbetween instrumented by a single predicate itemplate with [clause_start:(moda:clstart/2), clause_end:(modb:clend/2)] clstart(SiteId, AuxVar) :- … every_moduleis used as wildcard module qualifier Fields may be specified as inherited from a global template
:- lib(instrument) Meta-predicates can have templates specified for their arguments findall/3 = itemplate with […, meta_args:[_, ITemplateArg2, _], …] The exclude field prevents instrumentation application to calls / subgoals within a specific predicate or by the global template instrument_recursiveoption of instrument/3 itemplate with […,result:(mod:iresult/5),…] Predicate called during pretty-printing to insert results into html Instrumentation may be enabled and disabled at runtime (facilitates bottleneck search) assert field specifies whether instrumentation is dynamic Predicate calls made via extra level of indirection Body of disabled instrumentation replaced by true compile_term/1 invocation overhead at runtime So far instrumentation has been passive!
:- lib(instrument) Tool is also a compile-time code weaver! itemplate with […,code_weaver:(mod:iweaver/6),…] During compilation the iweaveris invoked passing the block of code undergoing compilation File File undergoing compilation Code Block of code being processed Type clause, head, body, fact, variable, conjunction, disjunction, conditional, goal WeavedCode Code processed by iweaver for insertion Mode Compile or print (pretty-printing) Module
:- lib(instprofile) Instrumentation / sampling based statistics profiler Both complimentary mechanisms create traces that are currently graphed and analysed offline Current metrics available are those of statistics/2 To be extended to user defined statistics and IC’s ic_stat_get/1 Sampling profiler Low-overhead sampling profiles indicate resource usage trends over time Multiple enabled profiles with different time periods supported Example usage: AndyE’s Capacitated Shortest Path ?- statsample(“MemoryProfile”, 5, [global_stack_used, trail_stack_used, gc_number, gc_collected], ‘memory.dat’) ?- statsample_control(“MemoryProfile”, on) ?- go(cut, d_70_7, ‘solution.ecl’) ?- statsample_control(“MemoryProfile”, off)
:- lib(instprofile) Instrumentation based profiler Accurate profiling of code fragments tied to callsite identifier Higher overhead, more intrusive clause, block, subgoal and call instrumentation points ?- statprofile('queens_gfc.pl', [global_stack_used, trail_stack_used]) ?- my_query(X, Y, Z) ?- result@my_program_module Delta values for the metrics across the code fragment can be recorded to file (open_delta_file/1, close_delta_file/1, delta_results:on) Aggregate results can be dumped to a trace file using aggregate_result/1
:- lib(modeanalyser) Instrumentation based mode analyser Suggestsmode/1 directives for predicate definitions ‘++’ ground, ‘+’ nonvar, ‘-’ fresh var, ‘?’ unknown Compiler generates compact and / or faster code Static (compile-time) analyses are slower and not so capable in a constraint (coroutined) system Note: Incorrect mode specifier results in potentially incorrect or undefined behaviour For ‘-’ mode specifier, the analyser cannot detect aliased variables (manually check)
:- lib(modeanalyser) [eclipse 1]: mode_analyser:analyse('queens_gfc.pl'). queens_gfc.pl compiled traceable 13920 bytes in 0.10 seconds [eclipse 2]: nqueens(8, Qs). L = [1, 5, 8, 6, 3, 7, 2, 4] Yes (0.00s cpu, solution 1, maybe more) ? ... [eclipse 5]: mode_analyser:result. nqueens(++, -) noattack(?, ?) safe(+) noattack(?, +, ++) mode_analyser:result([verbose:on])is very useful!
:- lib(papi) PAPI is a specification of a cross-platform interface to hardware performance counters on modern microprocessors Standard set of events for application performance analysis Both high- and low-level set of routines for accessing counters L<n> instruction and data cache statistics Instruction and cycle counts (load, stores, FPU, branches, etc) Microsecond timers Per-process counters (from processor-wide registers) Example use of high-level interface (L1 data cache) papi_start_counters([papi_l1_dch, papi_l1_dca], 2) garbage_collect, papi_stop_counters(([L1DCH, L1DCA], 2) papi_read_counters([L1DCH, L1DCA], 2) papi_accum_counters([L1DCH, L1DCA], 2)
Ongoing Work Instrumentation of included files and modules Accurate cost model for instprofile and papi Reduction in book-keeping overhead Avoid box/unbox of value during aggregation of results Reduce stack usage of fail event trail frames (via merging?) Is profile strategy for recursive predicates sufficient? Tail-recursion and last call optimisation must be preserved Dynamic instrumentation engine Enable / disable instrumentation at a specific callsite Drive instrumentation through call graph to locate bottlenecks Visualisation / graphing support