950 likes | 1.04k Views
CMPUT680 - Winter 2001. Register Minimization X Register Saturation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680.
E N D
CMPUT680 - Winter 2001 Register Minimization X Register Saturation José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680 CMPUT 680 - Compiler Design and Optimization
Touati, Sid Ahmed Ali, “Register Saturation in Superscalar and VLIW Codes,” 10th International Conference on Compiler Construction, Genova, Italy, April 2001, pp. 213-228. Touati, S.-A.-A., Thomasset, F., “Register Saturation in Data Dependence Graphs,” Research Report RR-3978, INRIA, July 2000. Touati, S.-A.-A., “Optimal Register Saturation in Acyclic Superscalar and VLIW Codes,” Researchh Report, INRIA, Nov. 2000. Reading List CMPUT 680 - Compiler Design and Optimization
Minimum Register Instruction Sequence (MRIS)Problem Given the Data Dependence GraphG for a basic block, derive an instruction sequenceS for G that is optimal in the sense that its register requirement is minimum. CMPUT 680 - Compiler Design and Optimization
b c d e f g Intuition for Our Solution a Our intuition is to find sub-sets of nodes that can definitely share a register to inform the instruction sequencing algorithm. h i Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
a d e b L1 = [a, b, f, h, i) f h Instruction Lineages An instruction lineage is a sequence of instructions in which a single register is passed from instruction to instruction (except for the last). a b c f g h How can we ensure that instructions a, b, f, and h will be able to share the same register? i Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
L1 = [a, b, f, h, i) Thus the lineage formation inserts sequencing edges in the DDG. Sequencing Edges The lineage formation imposed a scheduling restriction in the DDG: the selected heir of a node must be the last node listed among its siblings. a b c d e f g h i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
Node Height L1 = [a, b, f, h, i) If the introduction of sequencing edges was to produce a cycle in the DDG, it would be impossible to find a legal instruction sequence. a b c d e f g Thus we use the height of the nodes, recomputed after each lineage formation, to select the heir. Ties are broken arbitrarily. h i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
c d e L2 = [c, f) g L3 = [e, g, h) L4 = [d, g) Lineage Formation L1 = [a, b, f, h, i) For the next lineage, the heighest nodes not in a lineage are c, d, e, all with a height of 5. a b c d e f g h i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
a b c d e f g h i Lineage Interference L1 = [a, b, f, h, i) L2 = [c, f) L3 = [e, g, h) L4 = [d, g) Two lineages Lu = [u1, u2, …, um) and Lv = [v1, v2, …, vm) definitely overlap if: (i) u1reaches vn, and (ii) v1reaches um. Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
Lineage Interference Graph L1 = [a, b, f, h, i) L2 = [c, f) L3 = [e, g, h) a L4 = [d, g) b c d e Which lineages does lineage L1 definely overlap with? f g h L1 L4 How about lineages L2 and L4? i Augmented Data Dependence Graph L2 L3 Lineage Interference Graph CMPUT 680 - Compiler Design and Optimization
Lineage Fusion Condition L1 = [a, b, f, h, i) L1 L4 L2 = [c, f) L3 = [e, g, h) L4 = [d, g) a L2 L3 Lineages Lineage Interference Graph b c d e Two lineages Lu = [u1, u2, …, um) and Lv = [v1, v2, …, vn) can be fused into a single lineage if: (i) u1reaches vn, and (ii) v1does not reach um. f g h i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
Lineage Fusion Condition L1 = [a, b, f, h, I) L1 L4 L2 = [c, f) L3 = [e, g, h) L4 = [d, g) a L2 L3 Lineages Lineage Interference Graph b c d e f g Which lineages can be fused in the example? h d reaches f, and c does not reach g i Augmented Data Dependence Graph Thus L4 can be fused with L2 to form L5 = [d, g) [c, f) CMPUT 680 - Compiler Design and Optimization
Lineage Fusion L1 = {a, b, f, h, i} L1 L4 L2 = {c, f} L3 = {e, g, h} L4 = {d, g} a L2 L3 Lineages Lineage Interference Graph b c d e When Lu = [u1, u2, …, um) and Lv = [v1, v2, …, vn) are fused: (1) a scheduling edge from um to v1 is introduced in the augmented DDG (2) Lu and Lv are removed from the LIG (3) a new lineage Lw = Lu Lv is inserted in LIG f g h i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
Lineage Fusion Condition L1 = [a, b, f, h, I) L1 L3 = [e, g, h) L5 = [d, g) [c, f) a L5 L3 Lineages Lineage Interference Graph b c d e f g Thus the fusion of L4 with L2 form L5 = [d, g) [c, f) h How many colors we need to color the LIG? i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
Lineage Fusion Condition L1 = [a, b, f, h, I) L1 L3 = [e, g, h) L5 = [d, g) [c, f) a L5 L3 Lineages Lineage Interference Graph b c d e f g We need three colors. h Can we find an instruction sequence? i Augmented Data Dependence Graph CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e g Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e g c Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e g c b Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e g c b f Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e g c b f h Sequence CMPUT 680 - Compiler Design and Optimization
Sequencing by List Scheduling a L1 RA L1 = [a, b, f, h, I) RB L3 = [e, g, h) b c d e L5 = [d, g) [c, f) RC L5 L3 Lineages f g Lineage Interference Graph Registers h i Augmented Data Dependence Graph a d e g c b f h i Sequence CMPUT 680 - Compiler Design and Optimization
Summary of Our Solution Method • A “good” construction algorithm for LIG (dynamic) • An effective heuristic method to calculate the HRB • An efficient scheduling method (do not backtrack) DDG Form Lineage Interference Graph (LIG) Derive HRB Extended list-scheduling guided by HRB A good instruction sequence CMPUT 680 - Compiler Design and Optimization
Register Saturation (Touati) Given a data depende graph G, the register saturation (RS) of G is the maximal register need for any schedule of G. Touati’s strategy is to compute the RS of the G and, if RS exceeds the number of available registers, to reduce the RS by introducing new arcs in G. The intuition is that by using either (1) all available registers or (2) the maximal registers that G can use, instruction level parallelism is maximized. CMPUT 680 - Compiler Design and Optimization
The HRB and the RS Govind, Gao, Yang, Amaral, and Zhang had earlier proposed an alternative method: to find an heuristic register bound (HRB) to be used as a guidance in a modified list scheduling. Their goal is to find a schedule that uses a minimum number of registers. To compare both methods we will apply Touati’s method to Govind et al.’s example, and Govind’s method to Touati’s example. CMPUT 680 - Compiler Design and Optimization
pkillG(u) = { v Cons(u) / v Cons(u) = {v} } v is the set of all descendents of v, including v. w Cons(u) iff (w,u) G Potencial Killers To find the RS(G), we need to know which operation must kill each value generated. Touati’s define the set of operations that are potential killers of the value generated by an operation u G. Thus a node v is a potential killer of the value generated by a node u if and only if v consumes u and no descendent of v consumes u. CMPUT 680 - Compiler Design and Optimization
Potencial Killing Graph The edges of the Potential Killing Graph of a DDG G, PK(G)=(V, EPK), are defined as follows: EPK = {(u,v) / u VR v pkillG(u)} VR is the set of operations that define a value, i.e., operations that need a register. CMPUT 680 - Compiler Design and Optimization
b c d e f g Govind’s Example: Data Dependency Graph (a) t1 := ld(x); (b) t2 := t1 + 4; (c) t3 := t1 * 8; (d) t4 := t1 - 4; (e) t5 := t1 / 2; (f) t6 := t2 * t3; (g) t7 := t4 - t5; (h) t8 := t6 * t7; (i) st(y,t8); B3 a h i DDG G CMPUT 680 - Compiler Design and Optimization
b c d e f g Govind’s Example: Potential Kill Graph pkillG(a) = {b, c, d, e} pkillG(b) = {f} pkillG(c) = {f} pkillG(d) = {g} pkillG(e) = {g} pkillG(f) = {h} pkillG(g) = {h} pkillG(h) = {i} a h i DDG G CMPUT 680 - Compiler Design and Optimization
b b c c d d e e f f g g Govind’s Example: Potential Kill Graph a a h h i i DDG G PK(G) * In this example the DDG G and the potential kill graph PK(G) are identical. In general that is not the case. CMPUT 680 - Compiler Design and Optimization
Choosing the Killer If a node u has more than one potential killer, Touati defines a killing function, k(u), that specifies which one among the potential killers of u will actually kill u. A killing function imposes a scheduling order in the DDG: all other consumers of u , Cons(u), must be scheduled before k(u) is scheduled. To represent these scheduling constraints, Touati defines an extended DAG, Gk, induced by the killing function k. CMPUT 680 - Compiler Design and Optimization
b c d e f g Govind’s Example: Killing Function In this example, node a is the only node with multiple potential killers. a pkillG(a) = {b, c, d, e} pkillG(b) = {f} pkillG(c) = {f} pkillG(d) = {g} pkillG(e) = {g} pkillG(f) = {h} pkillG(g) = {h} pkillG(h) = {i} h i PK(G) CMPUT 680 - Compiler Design and Optimization
Govind’s Example: Killing Function a If we choose k(a) = b, we obtain the Gk on the left. b c d e f g pkillG(a) = {b, c, d, e} pkillG(b) = {f} pkillG(c) = {f} pkillG(d) = {g} pkillG(e) = {g} pkillG(f) = {h} pkillG(g) = {h} pkillG(h) = {i} h i Gk CMPUT 680 - Compiler Design and Optimization
Selecting a Good Set of Killers... If the killing function for multiple nodes with multiple potential killers is choosen arbitrarily, it might induce cycles in Gk. A valid killing function is one that does not induce cycles in Gk. CMPUT 680 - Compiler Design and Optimization
The descendents of k(u) cannot be simultaneously alive with u. Touati defines the Disjoint Value Graph, DVk(G) = (VR, EDV), by: EDV = {(u,v) / u, v VR v Rk(u)} Avoiding Vengeance... A killer must kill before it has children, thus... An edge (u,v) in DVk(G) means that the live interval of u is always before the live interval of v in any schedule of Gk. CMPUT 680 - Compiler Design and Optimization
Govind’s Example: Disjoint Value Graph k(a) = {b} k(b) = {f} k(c) = {f} k(d) = {g} k(e) = {g} k(f) = {h} k(g) = {h} k(h) = {i} a b c d e a f g b c d e h f g i h Gk i * simplified by transitive reduction DVk(G) CMPUT 680 - Compiler Design and Optimization
An antichain in a graph G(E,V) is a set of nodes A such that there are no paths between the nodes in A: A = {u, v V / (u,v) Ec (v,u) Ec} Register Need and Maximal Antichains The register need of any schedule of Gkis always less than or equal to a maximal antichain in DVk(G). Where Ec is the transitive closure of G: (u,v) Ec: (u,v) Ec iff a path p = (u, …, v) in G. CMPUT 680 - Compiler Design and Optimization
Govind’s Example: Maximal Antichain a The maximal antichain in this example is: b c d e f g AMk = {a, c, d, e} h Thus this graph, with this killing function can use at most 4 registers. i DVk(G) CMPUT 680 - Compiler Design and Optimization
Register Saturating Scheduling Touati proves that: For every valid killing k(V) function, there is always a schedule that makes all the values in the maximal antichain of the disjoint value DAG DVk(G) simultaneously alive. CMPUT 680 - Compiler Design and Optimization
Saturating Killing Function To find the register saturation of a DDG, we need to find a killing function that maximizes the maximal antichain in DVk(G). In other words, we need to find a killing function that maximizes the number of nodes that are not connected by a path in DVk(G). Touati calls this the maximizing maximal antichain (MMA) problem. A solution to the MMA problem is a saturating killing function. MMA is NP-complete. CMPUT 680 - Compiler Design and Optimization
/ e, e’ Ecb / target(e) = source (e’) Heuristic to Compute Register Saturation To compute the register saturation, Touati starts by decomposing the potential kill graph PK(G) into connected bipartite components. A bipartite component, cb = (Scb, Tcb, Ecb), is a graph with a set of source nodes Scb, a set of target nodes Tcb, and a set of edges Ecb. cb must obey the following conditions. If e EPK e’ Ecb e, e’ share an endpoint, then e Ecb CMPUT 680 - Compiler Design and Optimization
Bipartite Decomposition of PK(G) A bipartite decomposition of the potential killing graph PK(G) is a set of bipartite components such that for every edge e PK(G), there is a bipartite component cb in the decomposition such that e Ecb. Touati proves that given a DDG G, there is only one bipartite decomposition of G. CMPUT 680 - Compiler Design and Optimization
a b b b c c c d d d e e e f g f g h h i Govind’s Example: Bipartite Decomposition a f g h i PK(G) Bipartite Decomposition CMPUT 680 - Compiler Design and Optimization
Saturating Killing Set Touati defines the Saturating Killing Set of a connected bipartite component cb, SKS(cb), as a subset of the target nodes, Tcb’ Tcbsuch that: (1) All the source nodes, Scb, are contained in the union of all predecessors of the nodes in Tcb’. (2) Tcb’ contains a minimum number of nodes. Computing the SKS is an NP-complete problem. CMPUT 680 - Compiler Design and Optimization
a b b c c d d e e f g f g h h i Govind’s Example: Saturating Killing Set In this example the computation of SKS is trivial. The only component with a non-unitary target set is the top one. The selection of any single node in the set Tcb = {b, c, d, e} covers the set Scb = {a}. Thus the selection can be arbitrary. Bipartite Decomposition CMPUT 680 - Compiler Design and Optimization
Govind’s Example As we seen earlier with k(a) = b, the register saturation in Govind’s example is 4. And a schedule that has four values alive at the same time can be found. Using the lineage method, Govind et al. found a schedule for their example that uses three registers. What does Touati’s method does if only three registers are available? CMPUT 680 - Compiler Design and Optimization
Reducing RS Touati proposes an algorithm to reduce the register saturation while trying not to increase the length of the critical path. The algorithm starts by computing the maximal antichain AMk. Then it starts an interative process in which the first step is to construct the set Uk of all admissible serializations between the saturating values in AMk with their costs. CMPUT 680 - Compiler Design and Optimization