430 likes | 908 Views
Paraprox : Pattern-Based Approximation for Data Parallel Applications. Mehrzad Samadi , D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014. University of Michigan Electrical Engineering and Computer Science.
E N D
Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers Creating Custom Processors
Approximate Computing • 100% accuracy is notalways necessary • Less Work • Better performance • Lower power consumption • There are many domains where approximate output is acceptable
Data Parallelism is everywhere Financial Modeling Medical Imaging Physics Simulation Audio Processing Machine Learning Games Image Processing Statistics Video Processing • Mostly regular applications • Works on large data sets • Exact output is not required for operation Good opportunity for automatic approximation
Approximating KMeans Approximating alone is not enough we need a way to control the output quality
ApproximateComputing • Ask the programmer to do it • Not easy / practical • Hard to debug • Automatic Approximation • One solution does not fit all • Paraprox: Pattern-based Approximation • Pattern-specific approximation methods • Provide knobs to control the output quality
Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,… M. McCool et al. “Structured Parallel Programming: Patterns for Efficient Computation.” Morgan Kaufmann, 2012.
Paraprox Parallel Program (OpenCl/CUDA) Paraprox Approximation Methods Pattern Detection Runtime system Approximate Kernels Tuning Parameters
Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…
Approximate Memoization BlackScholes
Approximate Memoization Identify candidate functions Find the table size Check The Quality Determine qi for each input Fill the Table Execution
Candidate Functions • Pure functions do not: • read or write any global or static mutable state. • call an impure function. • perform I/O. • In CUDA/OpenCL: • No global/shared memory access • No thread ID dependent computation
Table Size Quality 64K 32K 16K Speedup
How Many Bits per Input? Table Size = 32KB 15 bits address Output Quality A B C 5 5 5 95.2% Inputs that do not need high precision will get fewer number of bits. 6 4 5 4 6 5 5 6 4 5 4 6 96.5% 91.3% 95.4% 91.2% 6 5 4 4 7 4 5 7 3 95.1% 95.4% 95.8%
Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…
Tile Approximation Difference with neighbors
Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j] SE = Input[i+1][j+1] NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile
Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] W S = Input[i+1][j] C SE = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile
Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] W N = Input[i-1][j] C NE = Input[i-1][j+1] E SW = Input[i+1][j-1] W S = Input[i+1][j] C SE = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile
Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j] SE = Input[i+1][j+1] C C NW N NE C W C E C C SW S SE C C C • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile
Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…
Scan/ Prefix Sum • Prefix Sum • Cumulative histogram, list ranking,… • Data parallel implementation: • Divide the input into smaller subarrays • Compute the prefix sum of each subarray in parallel
Data Parallel Scan 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Phase I Scan Scan Scan Scan 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 4 4 4 4 Phase II Scan 4 8 12 16 Phase III Add Add Add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Data Parallel Scan 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Phase I Scan Scan Scan Scan 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 4 4 4 4 Phase II Scan 4 8 12 16 Phase III Add Add Add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Scan Approximation Output Elements N 0
Experimental Setup • Clang 3.3 • GPU • NVIDIA GTX 560 • CPU • Intel Core I7 • Benchmarks • NVIDIA SDK, Rodinia, … Approximate Kernels AST Visitor Pattern Detection Action Generator Rewrite Driver CUDA
Runtime System Quality Checking Quality Target Quality Speedup Green[PLDI2010] SAGE[MICRO2013]
Speedups for Both CPU and GPU CPU Target = 90% GPU 7.9 Geometric Mean Speedup
One Solution Does Not Fit All! Paraprox Loop Perforation
Conclusion • Manual approximation is not easy/practical. • We need tools for approximation • One approximation method does not fit all applications. • By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality.
Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors