370 likes | 494 Views
Static Identification of Delinquent Loads. V.M. Panait Sasturkar W.-F. Fong. Agenda. Introduction Related Work Delinquent Loads Framework Address Patterns, Decision Criteria The heuristic: types of classes, computing the weights, final classes Results. Introduction.
E N D
Static Identification of Delinquent Loads V.M. Panait Sasturkar W.-F. Fong
Agenda • Introduction • Related Work • Delinquent Loads • Framework • Address Patterns, Decision Criteria • The heuristic: types of classes, computing the weights, final classes • Results
Introduction • Cache – one of the major current bottlenecks in performance • One approach: prefetch; but prefetch what ? Can’t prefetch everything… • Few loads are really “bad” – “delinquent loads” • This paper: classification of address patterns in the load instructions
Introduction • Done after code generation, but before runtime • Singled out 10% of all loads causing over 90% of the misses in 18 SPEC benchmarks • Gets even better combined with basic block profiling: 1.3% loads covering over 80% of the misses
Related Work • BDH method: classify loads based on following criteria: • Region of memory accessed by the load: S (stack), H (heap) or G (global). • Kind of reference: loading a scalar (S), element of array (A) or field of a structure (S) • Type of reference: (P)ointer or (N)ot.
Related Work • Some classes account for most misses: GAN, HSN, HFN, HAN, HFP, HAP. • The OKN method: 3 simple heuristics • Use of a pointer dereference • Use of a strided reference • None of the above • This paper is much more precise than both above methods
Delinquent Loads • Why not stores too ? Write buffers are apparently good enough • Why not do it in hardware ? They do, but: • Need additional specialized hardware • Complex decisions (fast) <-> complex hardware • Memory profiling: not always practical
Framework • Assembly code -> address patterns for each load instruction -> placement of the load instruction in a class • Classes + weights -> heuristic function • If the value of the heuristic is greater than a delinquency threshold, the instruction is classified as possibly delinquent
Address Patterns • Address Pattern = summary of how the source address of the load instruction is computed • Uses CFG and DF analysis (reaching definitions) (one address pattern for each control path reaching the load) • Only uses basic registers (BR): gp, sp, regparam, regret
The Decision Criteria • Classes are derived from these criteria • H1: Register usage in an address pattern (usage of BR’s) • H2: Type of operations used in address computation (arithmetic, logic) • H3: Maximum level of dereferencing
The Decision Criteria • H4: Recurrence (iterative walk through memory) • H5: Execution frequency – based on BB profiling; classifies loads as: • Rarely executed (used here as negative) • Seldom executed (idem) • Fairly often executed (not used here) • In a program hotspot
Decision Criteria and Classes • Each criterion results in a set of classes • Class = set of address patterns with a certain property • There are too many classes that can result; only some are considered, and some of those are also aggregated into one class
Decision Criteria and Classes • H1 – based classes: enumerations of the number of occurrences of each of the 4 BR’s in an address pattern • H2 – based classes: address patterns with multiplications and shift operations • H3 – based classes: as many as there are levels of dereferencing in the address patterns
Decision Criteria and Classes • H4 – based classes: two classes (address pattern involves recurrence or not) • H5 – based classes: three classes: rarely, seldom and program hotspot
Experimental Setup • SimpleScalar toolkit: cache simulator (for cache hits & misses), compiler, objdump • Procedure: Fortran -> C code (via f2c) -> MIPS executable (via C2MIPS compiler) -> disassembled code (via objdump) • Reconstruction of CFG and DF analysis
Experimental Setup • 2 stages: learning/training and experimental (actual) • Stage 1: get full memory profiling data on a subset of SPEC benchmarks, use it to compute weights for each class • Use the heuristic thus obtained on a new subset of benchmarks
The Heuristic: Types of Classes • Three types of classes: • Positive (loads in it are likely delinquent) • Negative (… not …) • Neutral • Positive classes have positive weights, negative ones have negative weights, neutral classes have a weight of zero
The Heuristic: Terminology • The miss probability of class F in benchmark j: • The amount of misses accounted for by members of class F in benchmark j:
The Heuristic: Terminology • mj(F,C) = likelihood of an instruction of class F in benchmark j to be a cache miss • However, if that instruction is only executed once, it won’t be a delinquent load • nj(F,C) = proportion out of total number of misses that members of F account for
The Heuristic: Terminology • Strength index: r = mj / nj • A benchmark j is irrelevant to a class F if both indices mj and nj are below certain thresholds. Otherwise it is relevant. • Positive class: r > 5% for all benchs. • Negative class: nj < 0.5% for all benchs. • Neutral class: r < 5% for 1+ benchs.
Computing the Weights • Form classes according to the five decision criteria • Compute mj, nj for each class • Weight of class Fk
Computing the Weights • This is the formula for positive classes only • Only relevant benchmarks are included in the formula • |.| is the cardinality of that set, i.e. the number of benchmarks relevant to that class
Aggregate Classes • AG1: both gp and sp are used 1+ each (comes from H1) • AG2: only sp used 2+ (H1) • AG3: either * or shifts are used (H2) • AG4: one level dereferencing (H3) • AG5: two level dereferencing (H3) • AG6: three level dereferencing (H3)
Aggregate Classes • AG7: address patterns containing a recurrence (H4) • AG8: loads with low frequency of execution (100 < f < 1000) (H5) • AG9: loads with fairly low frequency of execution (f < 100 times) (H5) • Weight formula for negative classes: negated mean of positive weights
The Heuristic Function 1 if 0 otherwise the load is delinquent
Precision and Coverage • Precision of a heuristic scheme H, (H): the (correct) number of loads that scheme H identifies as delinquent (the lower, i.e., closer to the real one, the better) • Coverage of a heuristic scheme H, (H): the number of cache misses caused by loads identified as delinquent by scheme H (the closer to 100%, the better)
Combination with BB profiling • Use the heuristic to sharpen the set returned by BB profiling • Also add loads that are not in the hotspots • is the percentage of the highest scoring loads detected by our method but not by profiling that we consider to be delinquent
Conclusions • The static scheme for identifying delinquent loads has a precision of 10% and coverage of over 90% over 18 benchmarks • More precise than related work, similar coverage • Immune to variation of framework parameters (e.g. cache size, assoc., input)