1 / 34

All-Pairs-Shortest-Paths for Large Graphs on the GPU

1. What Will We Cover?. Quick overview of Transitive Closure and All-Pairs Shortest Path Uses for Transitive Closure and All-Pairs GPUs, What are they and why do we care? The GPU problem with performing Transitive Closure and All-Pairs

valin
Download Presentation

All-Pairs-Shortest-Paths for Large Graphs on the GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz1,2, Joe Kider1 1University of Pennsylvania 2Lockheed Martin IS&GS

    2. 1 What Will We Cover? Quick overview of Transitive Closure and All-Pairs Shortest Path Uses for Transitive Closure and All-Pairs GPUs, What are they and why do we care? The GPU problem with performing Transitive Closure and All-Pairs…. Solution, The Block Processing Method Memory formatting in global and shared memory Results

    3. 2 Previous Work “A Blocked All-Pairs Shortest-Paths Algorithm” Venkataraman et al. “Parallel FPGA-based All-Pairs Shortest Path in a Diverted Graph” Bondhugula et al. “Accelerating large graph algorithms on the GPU using CUDA” Harish

    4. 3 NVIDIA GPU Architecture Processors are broken into blocks Each block has access to a shared memory Blocks can not communicate with each other All blocks have access to global GPU memory GPU does not have access to CPU memory except for explict memcpys made by CPU Mention CUDA, Extension of C Allows you to access the L1 cache explicitly Processors are broken into blocks Each block has access to a shared memory Blocks can not communicate with each other All blocks have access to global GPU memory GPU does not have access to CPU memory except for explict memcpys made by CPU Mention CUDA, Extension of C Allows you to access the L1 cache explicitly

    5. 4 Quick Overview of Transitive Closure The Transitive Closure of G is defined as the graph G* = (V, E*), where E* = {(i,j) : there is a path from vertex i to vertex j in G} -Introduction to Algorithms, T. Cormen Also explain what all pairs shortest path is here…Also explain what all pairs shortest path is here…

    6. 5 Quick Overview All-Pairs-Shortest-Path The All-Pairs Shortest-Path of G is defined for every pair of vertices u,v E V as the shortest (least weight) path from u to v, where the weight of a path is the sum of the weights of its constituent edges. -Introduction to Algorithms, T. Cormen Also explain what all pairs shortest path is here…Also explain what all pairs shortest path is here…

    7. 6 Uses for Transitive Closure and All-Pairs Social Network Analysis Communications networks Travel Networks Medical Social Network Analysis Communications networks Travel Networks Medical

    8. 7 Floyd-Warshall Algorithm K tells you which node we are connecting throughK tells you which node we are connecting through

    9. 8 Parallel Floyd-Warshall

    10. 9 The Question How do we calculate the transitive closure on the GPU to: Take advantage of shared memory Accommodate data sizes that do not fit in memory What were the issues with the previous parallel implementation of Transitive closure?What were the issues with the previous parallel implementation of Transitive closure?

    11. 10 Block Processing of Floyd-Warshall Want to perform partial processing of a sub-block of the matrix. Want a method that will allow a hierarchy of processing to occur. What we want to determine is what memory needs to be loaded to process a block of the data matrix.Want to perform partial processing of a sub-block of the matrix. Want a method that will allow a hierarchy of processing to occur. What we want to determine is what memory needs to be loaded to process a block of the data matrix.

    12. 11 Block Processing of Floyd-Warshall Start out with the simple caseStart out with the simple case

    13. 12 Block Processing of Floyd-Warshall We can compute the transitive closure for up to N = 4 We want to eventually compute the transitive closure for the total N, not the partial N Lets start by trying to compute the transitive closure for the entire graph up to K = 4.We can compute the transitive closure for up to N = 4 We want to eventually compute the transitive closure for the total N, not the partial N Lets start by trying to compute the transitive closure for the entire graph up to K = 4.

    14. 13 Block Processing of Floyd-Warshall Lets look at the corners Lets look at the corners

    15. 14 Block Processing of Floyd-Warshall Lets look at the cornersLets look at the corners

    16. 15 Block Processing of Floyd-Warshall

    17. 16 Block Processing of Floyd-Warshall Lets look at the cornersLets look at the corners

    18. 17 Increasing the Number of Blocks

    19. 18 Increasing the Number of Blocks

    20. 19 Increasing the Number of Blocks

    21. 20 Increasing the Number of Blocks

    22. 21 Increasing the Number of Blocks

    23. 22 Increasing the Number of Blocks

    24. 23 Increasing the Number of Blocks

    25. 24 Increasing the Number of Blocks

    26. 25 Running it on the GPU Using CUDA Written by NVIDIA to access GPU as a parallel processor Do not need to use graphics API Explain that each thread will load one memory location from global memory into the shared memory Then we’ll sync the threads and run the partial Floyd-WarshallExplain that each thread will load one memory location from global memory into the shared memory Then we’ll sync the threads and run the partial Floyd-Warshall

    27. 26 Partial Memory Indexing SP1

    28. 27 Memory Format for All-Pairs Solution All-Pairs requires twice the memory footprint of Transitive Closure

    29. 28 Results

    30. 29 Results

    31. 30 Results

    32. 31 Conclusion Advantages of Algorithm Relatively Easy to Implement Cheap Hardware Much Faster than standard CPU version Can work for any data size

    33. 32 Backup

    34. 33 CUDA CompUte Driver Architecture Extension of C Automatically creates thousands of threads to run on a graphics card Used to create non-graphical applications Pros: Allows user to design algorithms that will run in parallel Easy to learn, extension of C Has CPU version, implemented by kicking off threads Cons: Low level, C like language Requires understanding of GPU architecture to fully exploit Slide based upon David Kirk lecture http://courses.ece.uiuc.edu/ece498/al/lectures/lecture6-cuda-fixed.ppt Slide based upon David Kirk lecture http://courses.ece.uiuc.edu/ece498/al/lectures/lecture6-cuda-fixed.ppt

More Related