230 likes | 686 Views
Outline. Review ProblemSerial AlgorithmParallel AlgorithmResultsConclusion. The Problem. Let C be a sparse N x N matrixLet X be a dense N x L matrixLet N >> LTask:Compute Z = C ? (X*XT) where ? is the Hadamard, or element product. The Problem Current Implementation. For N = 1282, L = 256
E N D
1. Hadamard Product: Solving Thousands of Small Problems in CUDA orSqueezing Water out of a Rock
2. Outline Review Problem
Serial Algorithm
Parallel Algorithm
Results
Conclusion
3. The Problem Let C be a sparse N x N matrix
Let X be a dense N x L matrix
Let N >> L
Task:
Compute Z = C ? (X*XT)
where ? is the Hadamard, or element product
4. The ProblemCurrent Implementation For N = 1282, L = 256
Initialization: 2.5s
Measurement Update:
Compute mean: 37.4s
Compute Zi and HiZi : 1825.5s
Compute Ki : 218.3s
Update the ensemble: 729.8s
Time Update:
Add mean: 34.4s
Update the ensemble (Fi = I): 215.3s
5. Serial Algorithm
6. Parallel Decomposition
One block computes multiple inner products
One block computes one inner product
Multiple blocks compute one inner product
7. General Parallel Structure
8. Best Parallel Kernel
9. Best Parallel Kernel
10. Best Parallel Kernel
11. Varying the Problem Three Variables
Stencil
Changing stencil varies number of inner products (Work Units)
Ensemble Size
Changing ensemble size varies length of inner products (amount of work done per Work Unit)
Reconstruction Size
Only varies the number of Work Units
Keep this fixed
12. CPU Performance
13. CPU Performance
14. GPU Memory Overhead
15. GPU Performance
16. GPU Performance
17. GPU Performance
18. GPU Performance
19. Useful New Knowledge Generally float2 > float > float4
Using texture, generally the same
In one case though, float4 wins
33% occupancy often better than higher occupancy
Warp level synchronicity
Extra address calculations can actually help
Be careful with loop unrolling (preserve coalescing)
Unroll by block size, not by unroll factor
20. Conclusion 17 X speedup over current CPU implementation
30 minutes down to a few minutes
A system like the g80 can really help embarrassingly parallel problems
Significant speedup even though computation to memory load ratio is nearly 1
21. Future Work and Ideas CudaArray + 2D Texture Fetch
Fewer address calculations
But is this a good thing?
C matrix into constant memory
Maybe for faster reads
Not much reuse so cache use limited
Try to fold more computation into kernels