Paraprox : Pattern-Based Approximation for Data Parallel Applications

Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers Creating Custom Processors

Approximate Computing • 100% accuracy is notalways necessary • Less Work • Better performance • Lower power consumption • There are many domains where approximate output is acceptable

Data Parallelism is everywhere Financial Modeling Medical Imaging Physics Simulation Audio Processing Machine Learning Games Image Processing Statistics Video Processing • Mostly regular applications • Works on large data sets • Exact output is not required for operation Good opportunity for automatic approximation

Approximating KMeans

Approximating KMeans Approximating alone is not enough we need a way to control the output quality

ApproximateComputing • Ask the programmer to do it • Not easy / practical • Hard to debug • Automatic Approximation • One solution does not fit all • Paraprox: Pattern-based Approximation • Pattern-specific approximation methods • Provide knobs to control the output quality

Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,… M. McCool et al. “Structured Parallel Programming: Patterns for Efﬁcient Computation.” Morgan Kaufmann, 2012.

Paraprox Parallel Program (OpenCl/CUDA) Paraprox Approximation Methods Pattern Detection Runtime system Approximate Kernels Tuning Parameters

Common Patterns Map Partitioning Reduction Signal Processing, Physics,… Image Processing, Finance, … Machine Learning, Physics,.. Scatter/Gather Stencil Scan Machine Learning, Search,… Image Processing, Physics,… Statistics,…

Approximate Memoization BlackScholes

Approximate Memoization Identify candidate functions Find the table size Check The Quality Determine qi for each input Fill the Table Execution

Candidate Functions • Pure functions do not: • read or write any global or static mutable state. • call an impure function. • perform I/O. • In CUDA/OpenCL: • No global/shared memory access • No thread ID dependent computation

Table Size Quality 64K 32K 16K Speedup

How Many Bits per Input? Table Size = 32KB 15 bits address Output Quality A B C 5 5 5 95.2% Inputs that do not need high precision will get fewer number of bits. 6 4 5 4 6 5 5 6 4 5 4 6 96.5% 91.3% 95.4% 91.2% 6 5 4 4 7 4 5 7 3 95.1% 95.4% 95.8%

Tile Approximation Difference with neighbors

Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j] SE = Input[i+1][j+1] NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] W S = Input[i+1][j] C SE = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] W N = Input[i-1][j] C NE = Input[i-1][j+1] E SW = Input[i+1][j-1] W S = Input[i+1][j] C SE = Input[i+1][j+1] E NW N NE W C E SW S SE • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

Stencil/Partitioning C = Input[i][j] W = Input[i][j-1] E = Input[i][j+1] NW = Input[i-1][j-1] N = Input[i-1][j] NE = Input[i-1][j+1] SW = Input[i+1][j-1] S = Input[i+1][j] SE = Input[i+1][j+1] C C NW N NE C W C E C C SW S SE C C C • Paraprox looks for global/texture/shared load accesses to the arrays with affine addresses • Control the output quality by changing the number of accesses per tile

Scan/ Prefix Sum • Prefix Sum • Cumulative histogram, list ranking,… • Data parallel implementation: • Divide the input into smaller subarrays • Compute the prefix sum of each subarray in parallel

Data Parallel Scan 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Phase I Scan Scan Scan Scan 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 4 4 4 4 Phase II Scan 4 8 12 16 Phase III Add Add Add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Scan Approximation Output Elements N 0

Evaluation

Experimental Setup • Clang 3.3 • GPU • NVIDIA GTX 560 • CPU • Intel Core I7 • Benchmarks • NVIDIA SDK, Rodinia, … Approximate Kernels AST Visitor Pattern Detection Action Generator Rewrite Driver CUDA

Runtime System Quality Checking Quality Target Quality Speedup Green[PLDI2010] SAGE[MICRO2013]

Speedups for Both CPU and GPU CPU Target = 90% GPU 7.9 Geometric Mean Speedup

One Solution Does Not Fit All! Paraprox Loop Perforation

We Have Control on Output Quality

Distribution of Errors

Conclusion • Manual approximation is not easy/practical. • We need tools for approximation • One approximation method does not fit all applications. • By using pattern-based optimization, we achieved 2.6x speedup by maintaining 90% of the output quality.

Paraprox: Pattern-Based Approximation for Data Parallel Applications Mehrzad Samadi, D. Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke University of Michigan March 2014 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

Paraprox : Pattern-Based Approximation for Data Parallel Applications