Hadamard Product: Solving Thousands of Small Problems in CUDA or Squeezing Water out of a Rock

1. Hadamard Product: Solving Thousands of Small Problems in CUDA orSqueezing Water out of a Rock

2. Outline Review Problem Serial Algorithm Parallel Algorithm Results Conclusion

3. The Problem Let C be a sparse N x N matrix Let X be a dense N x L matrix Let N >> L Task: Compute Z = C ? (X*XT) where ? is the Hadamard, or element product

4. The ProblemCurrent Implementation For N = 1282, L = 256 Initialization: 2.5s Measurement Update: Compute mean: 37.4s Compute Zi and HiZi : 1825.5s Compute Ki : 218.3s Update the ensemble: 729.8s Time Update: Add mean: 34.4s Update the ensemble (Fi = I): 215.3s

5. Serial Algorithm

6. Parallel Decomposition One block computes multiple inner products One block computes one inner product Multiple blocks compute one inner product

7. General Parallel Structure

8. Best Parallel Kernel

9. Best Parallel Kernel

10. Best Parallel Kernel

11. Varying the Problem Three Variables Stencil Changing stencil varies number of inner products (Work Units) Ensemble Size Changing ensemble size varies length of inner products (amount of work done per Work Unit) Reconstruction Size Only varies the number of Work Units Keep this fixed

12. CPU Performance

13. CPU Performance

14. GPU Memory Overhead

15. GPU Performance




19. Useful New Knowledge Generally float2 > float > float4 Using texture, generally the same In one case though, float4 wins 33% occupancy often better than higher occupancy Warp level synchronicity Extra address calculations can actually help Be careful with loop unrolling (preserve coalescing) Unroll by block size, not by unroll factor

20. Conclusion 17 X speedup over current CPU implementation 30 minutes down to a few minutes A system like the g80 can really help embarrassingly parallel problems Significant speedup even though computation to memory load ratio is nearly 1

21. Future Work and Ideas CudaArray + 2D Texture Fetch Fewer address calculations But is this a good thing? C matrix into constant memory Maybe for faster reads Not much reuse so cache use limited Try to fold more computation into kernels

Hadamard Product: Solving Thousands of Small Problems in CUDA or Squeezing Water out of a Rock

Hadamard Product: Solving Thousands of Small Problems in CUDA or Squeezing Water out of a Rock

Presentation Transcript

Methodology of Problem Solving

Advanced Hydrology and Water Resources Management

Hybrid Redux : CUDA / MPI

CHEMISTRY CHAPTER 4 PROBLEM SOLVING IN CHEMISTRY

Rocks and the Rock Cycle

Solving Genetics Problems

A Discussion of CPU vs. GPU

Hadamard matrices and the hadamard conjecture

CUDA Library and Demo

Solving problems by searching

Thinking & Problem Solving

Problem solving and Creativity

Solving problems by searching

Doubles of Hadamard 2-(15,7,3) Designs

100M CUDA GPUs

Solving problems by searching

Lecture 4

CH – 7 Problem Solving ( 困難排除）

Solving Energy problems 2

SOLVING WORD PROBLEMS

Solving Problems Using Linear Systems

Hadamard Product: Solving Thousands of Small Problems in CUDA or Squeezing Water out of a Rock