200 likes | 315 Views
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min. Introduction. Motivation
E N D
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min
Introduction • Motivation : Current parallelizing compilers cannot handle complex or statically insufficiently defined access patterns. ( input dependent, run-time dependent conditions, subscripted subscripts, etc…) • LRPD Test - Speculatively executes the loop as a doall - applies a fully parallel data dependency test (x-iter.) - if the test fails, then the loop is re-executed serially
Inspector-Executor Method • Inspector/Executor - extract and analyze the memory access pattern - transform the loop if necessary and execute • Disadvantage - cost and side effect : if the address computation of the array under test depends on the actual data computation. - parallel execution of the inspector loop is not always possible
speculative run-time parallelization Compile time Polaris Static analysis Run Time Run-time transformations Checkpoint reorder heuristic Speculative parallel execution fail test sequential execution restore pass
Hazards(during the speculative execution) • Exceptions - invalidate the parallel execution - clear the exception flag, restore the values of any altered variables, and execute serially. • Cross-iteration dependencies in the loop - LRPD Test
LPD Test(The Lazy Privatizing doall Test) 1. Marking Phase - For each shared array A[1:s] - read, write and not-private shadow arrays, Ar[1:s], Aw[1:s], and Anp[1:s] (a) Uses : if this array element has not been modified, then set corresponding elem. in Ar and Anp (b) Defs : set corresp. elem. in Aw and clear in Ar if set. (c) twi(A) : Count the total number of write accesses to A that are set in this iteration (i : iteration #)
LPD Test(The Lazy Privatizing doall Test) 2. Analysis Phase (Performed after the speculative exec.) (a) Compute (i) tw(A) = (twi(A)) (ii) tm(A) = sum(Aw[1:s]) (iii) tm(A) != tw(A) : cross iteration output depend. (b) If any(Aw[:] & Ar[:]), then ends the phase. : def and use values stored at the same location in different iterations (flow/anti dependency)
LPD Test(The Lazy Privatizing doall Test) 2. Analysis Phase (Performed after the speculative exec.) (c) Else if tw(A) == tm(A), then the loop is doall (without privatizing the array A) (d) Else if any(Aw[:] & Anp[:]), then the array A is not privatizable. (there is at least one iteration in which some element of A was used before modified) (e) Otherwise, the loop was made into a doall by privatizing the shared array A.
Dynamic dead reference elimination • To avoid introducing false dependences, the marking of the read and private shadow arrays, Ar and Anp can be postponed until the value of the shared variable is actually used. • Definition : A dynamic dead read reference in a loop is a read access of a shared variable that does not contribute to the computation of any other shared variable which is live at loop end. • The “lazy” marking employed by the LPD test, i.e., the dynamic dead reference elimination tech., allows it to qualify more loops than the PD test.
Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then A(L(i)) = z + C(i) endif enddo PD Test Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endif enddo B1(1:5) = (1 0 1 0 1) K(1:5) = (1 2 3 4 1) L(1:5) = (2 2 4 4 2)
PD Test Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endif enddo Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then A(L(i)) = z + C(i) endif enddo B1(1:5) = (1 0 1 0 1) K(1:5) = (1 2 3 4 1) L(1:5) = (2 2 4 4 2)
LPD Test Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then markread(K(i)) markwrite(L(i)) A(L(i)) = z + C(i) endif enddo Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then A(L(i)) = z + C(i) endif enddo B1(1:5) = (1 0 1 0 1) K(1:5) = (1 2 3 4 1) L(1:5) = (2 2 4 4 2)
Run-time Reduction Parallelization • Recognition of reduction variable + Parallelizing reduction variable • Pattern matching identification - The DD test to qualify a statement as a reduction statement cannot be performed statically in the presence of input-dependent access patterns. - Syntactic pattern matching cannot identify all potential reduction variables (e.g. subscripted subscripts)
The LRPD Test : Extending the LPD Test for Reduction Validation do i = 1, n S1: A(K(i)) = ……… S2: ……… = A(L(i)) S3: A(R(i)) = A(R(i)) + exp() enddo doall i = 1, n markwrite(K(i)) markredux(K(i)) S1: A(K(i)) = ……… markread(L(i)) markredux(L(i)) S2: ……… = A(L(i)) markwrite(R(i)) S3: A(R(i)) = A(R(i)) + exp() enddo (a) Source program Anx : To check only that the reduction variable is not accessed outside the single reduction statement. (b) transformed program markredux operation sets the shadow array element of Anx to true
LRPD Test • Modified Analysis Pass - 2(d’) Else if any(Aw[:] & Anp[:] & Anx[:]), then some elements of A written in the loop is neither a reduction variable nor privatizable. Thus, the loop is not a doall and the phase ends. - 2(e’) Otherwise, the loop was made into a doall by parallelizing reduction and privatization.
Other Run-time Parallelization Papers • “Techniques for Speculative Run-Time Parallelization of Loops”, Manish, Gupta and Rahul Nim, SC’98. - More efficient run-time array privatization - No rolling back of entire loop computation and complete the loop (by generating synchronization) - Early hazard detection
Other Run-time Parallelization Papers • “Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors”, Ye Zhang, L., Rauchwerger, and Josep Torrellas. HPCA 1998. • - Run-time parallelization techniques are often computationally expensive and not general enough. • - Idea : execute the code in parallel speculatively and let extended cache coherence protocol hardware detect any dependence violations. • - Perf. 7.3 for 16 procs. & 50% faster than soft-only