Load-Reuse Analysis design and evaluation

Load-Reuse Analysisdesign and evaluation Rastislav Bodík Rajiv Gupta Mary Lou Soffa

x:=a+b y:=a+b Partial Redundancy Elimination (PRE) • Partially redundant = computed on some incoming paths

a+b a+b a:=..

Steps: find “reuse” paths, •  remove redundancy from “reuse” paths.

Register promotion = PRE of loads • Three steps: load-reuse analysis:find loads that can reuse prior loads/stores alias analysis:which stores may kill reuse? transformation: remove redundancy: PRE [PLDI ‘98] store a1, x load a2 store a3 load a4

Design goal: completeness find all reuse To approach completeness, the analysis is uniform:analyze scalar, array, and pointer loads path-sensitive:different source of reuse on each path Evaluation goal: how complete? compare with ideal analysis Detecting all reuse is undecidable: no ideal algorithm exists instead, use simulation Load-reuse analysis

Experimental framework program input load-reuse analysis simulator 1. 2. data-flow solution profile estimator 3. reuse level weighted solution transformation [PLDI ‘98] comparison 4.

1. Load-reuse analysis • It’s a data-flow analysis • on a reuse-aware representation: Value Name Graph (VNG):[POPL’98] • What’s new? Sparse version of the VNG • up to 30-times smaller than non-sparse Analyzing indirect loads/stores • also, model killing stores

Naming the value y := b+c a := c-1 x := a+b+1

b+c a+b+1 x names for the value in ‘x’

GEN 1 1 1 x b+c a+b+1

Naming the value across loads f offset: 0 next 4 *p 1 .. := p->f .. := p->next->f GEN **(p+4) 1 *r := ... **(p+4) 1 p := p->next *p 1 *p **(p+4)

kill if r = p+4 or r = *(p+4) KILL 

Sparse representation for I = 1, N { .. := A[I] + A[I-1] } a1 := A+I load a1 a2 := A+I-1 load a2 I := I+1

Ø Ø 1 1 GEN load a1 1 1 1 1 load a2 1 1

2. The simulator algorithm for I = 1, N { .. := A[I] + A[I-1] } Ø memory access history A[I] load a1 103 102 101 100 history length = 1 to 4 A[I-1] load a2 102 101 100 99 Simulator detects all PRE-exploitable reuse (up to given history length), but also some “noise”: e.g. due to hash table accesses

Ideal amount of load reuse % of all dynamic loads go m88ksim gcc compress li ijpeg vortex tomcatv swim su2cor hydro 1 4 history length 65% of executed loads has reuse exploitable by PRE intra-procedural reuse, history=1

3. How frequent is the reuse? load x • Edge profile: + cheap and available -cannot reconstruct frequencies of reuse paths 50 100 65 10 35 load x 40 75 35 40 5 30 900 855 25 kill x 75 20 55 load x

Path profile: + precise - more expensive •  Use edge profile, but bound its inherent error: compute lower & upper bound on reuse

Hierarchy of estimators Estimator:data-flow solution + edge profile weighted data-flow solution PRE CMP1 smaller error (but more complex) CMPc CMPr CMPf • Hierarchy:a practical approach •  A simple estimator not precise enough? Use next better one !

The algorithms • 1. The bounds: • generators:points generating reuse • stealers: points with no reuse • upper bound:all reuse consumed • lower bound:all reuse stolen load x 50 100 65 10 35 load x 75 40 35 40 5 30 900 855 25 kill x 75 20 55 load x 150

2. Separating uncertainty: • using the CMP region • defined for PRE [PLDI ‘98] • CMP = code-motion preventing • all error is contained in the CMP region!

Improving precision “one” region connected regions control flow reachability network flow reachability

error Estimators: precision PRE CMP1 CMPc smaller error CMPr CMPf FP INT

4. Analysis: how close to ideal ? 100% = reuse seen by simulator **p ideal alias info *p calls array & pointer stores + calls all stores + calls reuse killed by:

Related Work • Load-Reuse Analysis • makes value numbering path-sensitive • Steffen, Knoop, Rüthing Value Flow Graph [ESOP ‘90] we show how analyze indirect loads, via symbolic evaluation • Simulation-based analysis evaluation • Diwan, McKinley, Moss [PLDI’98] Type-based alias analysis: how powerful it needs to be? • Estimators • Ramalingam “Frequency Analysis” [PLDI’96] returns a single estimate, not its bounds

Summary • Load-reuse analysis: • reuse across indirect memory references • sparse representation • Estimators: three principles • confidence: bound the edge-profile error • separation of uncertainty: inside/outside the CMP region • hierarchy: increasing precision and complexity • Evaluation: • about 65% loads are amenable to PRE • our analysis can find about 80% of those

Combine three removal methods PLDI ‘98 control speculation S code motion restructuring R M

S M R Example: 10 50 a+b a+b a+b

S M R Relative removal power Loads removed,dynamic count, normalized INT FP Global CSE path-insensitive

Load-Reuse Analysis design and evaluation