1 / 55

Automatic task generation for DAGuE

Automatic task generation for DAGuE. http:// icl.utk.edu /dague. George Bosilca , Aurelien Bouteiller , Anthony Danalis , Mathieu Faverge , Thomas Herault, Jack Dongarra. The DAGuE system. DAGuE Compiler. Serial Code to Dataflow Representation. Example: QR Factorization.

trish
Download Presentation

Automatic task generation for DAGuE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic task generation for DAGuE http://icl.utk.edu/dague George Bosilca, AurelienBouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra

  2. The DAGuE system

  3. DAGuE Compiler Serial Code to Dataflow Representation

  4. Example: QR Factorization

  5. Input Format – Quark (PLASMA) for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Sequential C code • Annotated through QUARK-specific syntax • Insert_Task • INOUT, OUTPUT, INPUT • REGION_L, REGION_U, REGION_D, … • LOCALITY • StarPU syntax in progress

  6. DAGuE Compiler analysis steps for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } • Record all USEs • Record all DEFinitions • Formulate as Omega Relations: • all true (flow) dependencies • all output dependencies • Compute the differences • Formulate all anti-dependencies • Finalize synchronization edges

  7. Traditional Compiler (Control Flow) for(k… for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } } for(m… Control Flow Graph for(n… for(m…

  8. Data Flow imposes ordering for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }

  9. Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) n=k+1 m=k+1

  10. Dataflow Analysis MEM Incoming Data • Example on task DGEQRT of QR • Polyhedral Analysis through Omega • Compute algebraic expressions for: • Source and destination tasks • Necessary conditions for that data flow to exist k=SIZE-1 Outgoing Data k=0 for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) U L n=k+1 m=k+1

  11. Intermediate Representation: Job Data Flow GEQRT(k)  /* Execution space */  k = 0..( MT < NT ) ? MT-1 : NT-1 )  /* Locality */  : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k)          -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER]          -> (k < MT-1) ? A1TSQRT(k, k+1)         [type = UPPER]          -> (k == MT-1) ? A(k, k)                  [type = UPPER] WRITET <- T(k, k)          -> T(k, k)          -> (k <  NT-1) ? TUNMQR(k, k+1 .. NT-1)  /* Priority */  ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible

  12. Intermediate Representation: Job Data Flow GEQRT(k)  /* Execution space */  k = 0..( MT < NT ) ? MT-1 : NT-1 )  /* Locality */  : A(k, k) RWA <- (k == 0) ? A(k, k) : A1TSMQR(k-1, k, k)          -> (k < NT-1) ? AUNMQR(k, k+1 .. NT-1) [type = LOWER]          -> (k < MT-1) ? A1TSQRT(k, k+1)         [type = UPPER]          -> (k == MT-1) ? A(k, k)                  [type = UPPER] WRITET <- T(k, k)          -> T(k, k)          -> (k <  NT-1) ? TUNMQR(k, k+1 .. NT-1)  /* Priority */  ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END Control flow is eliminated, therefore maximum parallelism is possible

  13. Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1

  14. Dataflow Analysis USE for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } A[k’][k’] : 0 <= k’ <N-1 Ctrl Flow k’ > k DEF A[m][n] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1

  15. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Flow Dependency (RAW) Relation [k,m,n] -> [k’] : k+1<= m < N-1 k+1<= n < N-1 0 <= k < N-1 0 <= k’ < N-1 k < k’ m = k’ n = k’

  16. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N}

  17. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Omega Simplified {[k,m,m] -> [m] : k+1, 0 <= m < N} Output Dependency (WAW)

  18. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} n=k+1 m=k+1

  19. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} n=k+1 m=k+1

  20. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } Real Edge: Flow - Output {[k,k+1,k+1] -> [k+1] : 0<= k <= N-2} GEQRT’s incoming edge {[k-1,k, k] -> [k] : 0 < k <= N-1} (k>0) ? TSMQR(k-1,k,k) n=k+1 m=k+1

  21. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }

  22. Dataflow Analysis for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } GEQRT - > UNMQR {[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 } -> (k<N-1) ? UNMQR(k, k+1..N-1)

  23. Anti-dependencies • In theory, anti-deps do not matter in distributed memory • But real machines are distributed/shared memory hybrids • Anti-deps must create synchronization edges • Overestimating anti-deps is safe (albeit slow) • Output deps should be treated the same … in theory

  24. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } }

  25. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } n=k+1 m=k+1

  26. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘}

  27. Anti-dependencies in QR? for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } } } TSMQR - > GEQRT {[k, m, n] -> [k‘] : n=m=k‘} n=k+1 m=k+1

  28. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  29. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  30. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} anti-dep: {[k,m,n] -> [k‘] : n=m=k‘} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  31. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  32. TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1 GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 ) {[k,m,n]->[n] : k+1==n && k+1==m} {[k,m,n]->[k+1,m,n]: n>1+k && m>1+k} {[k,m,n]->[k,m+1,n]: m<mt-1} {[k,m,n]->[k+1,n] : m==k+1 && n>m} {[k]->[k,k+1] : mt >= (k+2)} {[k]->[k,n] : k < n < nt && k < nt-1} {[k,m]->[k,m,n] : k<nt-1 && k<n< nt {[k,m,n]->[n,m] : n==(k+1) && m>n} {[k,n]->[k,k+1,n]: k < mt-1} {[k,m]->[k,m+1]: m<mt-1} UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1 TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1

  33. FinalizingAnti-deps

  34. FinalizingAnti-deps

  35. FinalizingAnti-deps Transitive Closure is undecidable!

  36. IsDescendantOf()

  37. Current/Future Work Can we address non-affine codes?

  38. Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative)

  39. Example: Reduction Operation • Reduction: apply a user defined operator on each data and store the result in a single location. (Suppose the operator is associative and commutative) for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i+= 2*s) V[i] = op(V[i], V[i+s]) Issue: Non-affine loops lead to non-polyhedral array accessing

  40. Example: Reduction Operation 0 reduce(l, p) : V(p) l = 1 .. depth+1 p = 0 .. (MT / (1<<l)) RW A <- (1 == l) ? V(2*p) : Areduce( l-1, 2*p )        -> ((depth+1) == l) ? V(0)        -> (0 == (p%2))? Areduce(l+1, p/2) : Breduce(l+1, p/2) READB <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0)        <- (1 == l) ? V(2*p+1)        <- (1 != l) ? Areduce( l-1, p*2+1 ) BODY operator(A, B); END 1 2 3 Current Solution: Hand-writing of the data dependency using the intermediate Data Flow representation

  41. Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } }

  42. Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }

  43. Handling Reduction for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } } } Loop Canonicalization for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } } }

  44. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } }

  45. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii')

  46. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') Hard to solve statically

  47. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } For an edge to exist it must be: j = j'+i' => jj' = (jj * 2**ii) / (2**ii') - 1/2 OR j = j'=> jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj}

  48. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } constant jj' = (jj * 2**ii) / (2**ii') – 1/2 jj' = (jj * 2**ii) / (2**ii') But a given (source) task has fixed {ii, jj} constant

  49. Handling Reduction for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } } } jj' = C / (2**ii') – 1/2 jj' = C / (2**ii') But a given (source) task has fixed {ii, jj}

  50. Handling Reduction Finding a destination task means finding integers that satisfy either equation. Run-time upper bound for cost: log(NT/2) jj' = C / (2**ii') – 1/2 jj' = C / (2**ii')

More Related