560 likes | 688 Views
Automatic task generation for DAGuE. http:// icl.utk.edu /dague. George Bosilca , Aurelien Bouteiller , Anthony Danalis , Mathieu Faverge , Thomas Herault, Jack Dongarra. The DAGuE system. DAGuE Compiler. Serial Code to Dataflow Representation. Example: QR Factorization.
E N D
Automatic task generation for DAGuE http://icl.utk.edu/dague George Bosilca, AurelienBouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra
DAGuE Compiler Serial Code to Dataflow Representation
Input Format – Quark (PLASMA) for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Sequential C code • Annotated through QUARK-specific syntax • Insert_Task • INOUT, OUTPUT, INPUT • REGION_L, REGION_U, REGION_D, … • LOCALITY • StarPU syntax in progress
DAGuE Compiler analysis steps for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Record all USEs • Record all DEFinitions • Formulate as Omega Relations: • all true (flow) dependencies • all output dependencies • Compute the differences • Formulate all anti-dependencies • Finalize synchronization edges
Traditional Compiler (Control Flow) for(k… for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } for(m… Control Flow Graph for(n… for(m…
Data Flow imposes ordering for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }
Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) n=k+1 m=k+1
Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR • Polyhedral Analysis through Omega • Compute algebraic expressions for: • Source and destination tasks • Necessary conditions for that data flow to exist k=SIZE-1 Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) U L n=k+1 m=k+1
Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k) -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITET <- T(k, k) -> T(k, k) -> (k < NT-1) ? TUNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible
Intermediate Representation: Job Data Flow GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k) -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITET <- T(k, k) -> T(k, k) -> (k < NT-1) ? TUNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible
Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1
Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 Ctrl Flow k’ > k DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Flow Dependency (RAW) Relation [k,m,n] -> [k’] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1 0 <= k’ < N-1 k < k’ m = k’ n = k’
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N}
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N} Output Dependency (WAW)
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} n=k+1 m=k+1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} n=k+1 m=k+1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} (k>0) ? TSMQR(k-1,k,k) n=k+1 m=k+1
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }
Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 } -> (k<N-1) ? UNMQR(k, k+1..N-1)
Anti-dependencies • In theory, anti-deps do not matter in distributed memory • But real machines are distributed/shared memory hybrids • Anti-deps must create synchronization edges • Overestimating anti-deps is safe (albeit slow) • Output deps should be treated the same … in theory
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } }
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } n=k+1 m=k+1
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘}
Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘} n=k+1 m=k+1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
FinalizingAnti-deps Transitive Closure is undecidable!
Current/Future Work Can we address non-affine codes?
Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)
Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative) for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i+= 2*s) V[i] = op(V[i], V[i+s]) Issue: Non-affine loops lead to non-polyhedral array accessing
Example: Reduction Operation 0 reduce(l, p) : V(p) l = 1 .. depth+1 p = 0 .. (MT / (1<<l)) RW A <- (1 == l) ? V(2*p) : Areduce( l-1, 2*p ) -> ((depth+1) == l) ? V(0) -> (0 == (p%2))? Areduce(l+1, p/2) : Breduce(l+1, p/2) READB <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0) <- (1 == l) ? V(2*p+1) <- (1 != l) ? Areduce( l-1, p*2+1 ) BODY operator(A, B); END 1 2 3 Current Solution: Hand-writing of the data dependency using the intermediate Data Flow representation
Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } }
Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }
Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } }
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii')
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') Hard to solve statically
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj}
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } constant jj' = (jj * 2**ii) / (2**ii') – 1/2 jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj} constant
Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } jj' = C / (2**ii') – 1/2 jj' = C / (2**ii') But a given (source) task has fixed {ii, jj}
Handling Reduction Finding a destination task means finding integers that satisfy either equation. Run-time upper bound for cost: log(NT/2) jj' = C / (2**ii') – 1/2 jj' = C / (2**ii')