1 / 19

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

Paragon: Collaborative Speculative Loop Execution on GPU and CPU. Mehrzad Samadi 1 Amir Hormati 2 Janghaeng Lee 1 and Scott Mahlke 1. 1 University of Michigan - Ann Arbor 2 Microsoft Research, Microsoft. Amdahl’s Law. GPGPU may have <100x speedup but. 50%. 50%.

glora
Download Presentation

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paragon: Collaborative Speculative Loop Execution on GPU and CPU MehrzadSamadi1 Amir Hormati2 Janghaeng Lee1 and Scott Mahlke1 1University of Michigan - Ann Arbor 2Microsoft Research, Microsoft

  2. Amdahl’s Law • GPGPU may have <100x speedup but... 50% 50% NO GPU utilization GPU Executable Even 1000x here does NOT bring more than 2x in overall NO GPU utilization Execution Time

  3. General Purpose Computing on GPU • Limitation of • Massive Data-Parallelism • Linear array access • NO Indirect array access • NO Pointers • Leaves GPUs underutilized • GPUs are not so much generalized GPU Executable How can GPUs be more GENERAL?

  4. Motivation – More Generalization • Reduce Sections • Non-Linear array access • Indirect array access • Array access through pointers • Difficult for programmers to verify • Loop-Carried Dependencies NO GPU utilization for(y=0; y<ny; y++) for(x=0; x<nx; x++){xr = x %squaresize[XUP];yr = y %squaresize[YUP];i = xr + yr;lattice[i].x = x;lattice[i].y = y; } for(i=1; i<m; i++) for(j=iaL[i]; j<iaL[i+1]-1; j++) x[i] = x[i] - aL[j] * x[jaL[j]]; for(int i=0; i<n; i++){ *c = *a + *b; a++; b++; c++; }

  5. Motivation – More Generalization • Reduce Sections • Non-Linear array access • Indirect array access • Array access through pointers • Difficult for programmers to verify • Loop-Carried Dependencies NO GPU utilization

  6. Paragon Execution Loop 2 Sequential Loop 1 Loop 3 Sequential Sequential DO-ALL Possibly-Parallel DO-ALL Sequential CPU Sequential Sequential L2 CPU L3 L1 L2 GPU Conflict Check

  7. Paragon Execution with Conflict Loop 2 Sequential Loop 1 Loop 3 Sequential Sequential DO-ALL Possibly-Parallel DO-ALL Sequential CPU Sequential Sequential L2 CPU L1 L2 L3 GPU Conflict

  8. Paragon Process Flow Input: Sequential Code Offline Compilation Runtime Kernel Management Loop Classification Profiling Conflict Management Unit Instrumentation Executionwithout Profiling CUDA + pThread

  9. Offline Compilation • Loop classification • Sequential Loops • Dependence determined at compile-time • Assign to CPU statically • DO-ALL Loops • Assign to GPU statically • Possible DO-ALL Loops • Dependence can be determined at RUNTIME

  10. Runtime Profiling • Spawns thread on CPU • Sequential execution thread • Monitoring thread • Keeps track of memory foot print • Marks loop • Sequential • If many conflicts • Parallelizable • If no/few conflicts Assigned to CPU and GPU

  11. Conflict Detection - Logging • Lazy conflict detection • Allocate memory when executing kernel • “write-set” for store • “read-set” for load intC_wr_log[sizeof_C]; boolC_rd_log[sizeof_C]; for (i = 0; i < N; i++){ idx = I[i]; C[idx] = A[idx] + B[idx]; } for (i = tid; i < N; i += ThreadCnt){ idx = I[i]; C[idx] = A[idx] + B[idx]; } AtomicInc(C_wr_log[idx]);

  12. Conflict Detection - Checking • Done in parallel following kernel • Conflict if • Address written more than once • Address read and written at least once C_rd_log C_wr_log Thread 1 [0] [0] F 0 OK Thread 2 [1] [1] F 0 Thread 3 [2] [2] F 1 Thread 4 [3] [3] F 0 ... ... ... Conflict F T 0 Thread N F T 2 1 [N] [N]

  13. Experimental Setup • CPU • Intel Core i7 • GPU • NVIDIA GTX 560 with 2GB DDR5 • Benchmark • Loops with pointers • FDTD, Siedel, Jacobi2d, GEMM, TMV • Indirect/Non-Linear access • Saxpy, House, Ipvec, Ger, Gemver, SOR, FWD

  14. Results for Pointer Access 36x

  15. Results for Indirect Access 3.8x

  16. Conclusion • Paragon improves performance • More GPU Utilization • Speculatively run possibly-parallel loops on GPU • No performance penalty on mis-speculation • Letting CPU run sequentially at the same time • Conflict checking is done in GPU

  17. Q & A

  18. Overhead Breakdown

  19. Overhead Breakdown

More Related