310 likes | 367 Views
K F U S I O N Simple Annotations for Optimized Data Flow. Liam Kiemele, Celina Berg, Aaron Gulliver, Yvonne Coady University of Victoria with thanks to Tim Mattson, Andrew Brownsword (Intel). Road Map. KFusion at work Motivation KFusion Costs and benefits a nnotations, lines of code
E N D
K F U S I O NSimple Annotations for Optimized Data Flow Liam Kiemele, Celina Berg, Aaron Gulliver, Yvonne Coady University of Victoria with thanks to Tim Mattson, Andrew Brownsword (Intel)
Road Map • KFusion at work • Motivation • KFusion • Costs and benefits • annotations, lines of code • modularity, performance • Future work and conclusion • explicit composition of computation around data flow IWOCL 2013 Kiemele
Parallel Hardware IWOCL 2013 Kiemele
Good News and Bad News… • Parallelism • Added complexity • Optimization • Memory and Bandwidth • Modularity: Let’s talk Libraries • Details behind an API • Optimize data access (prefetching, caching…) • Better separation of concerns IWOCL 2013 Kiemele
OpenCL Libraries • OpenCL (Computing Language), for CPUs and GPUs • At the heart of any given library will be kernels • Suppose we build an OpenCL Linear Algebra Library __kernelvoid add_vectors(__global float* sum, __global float* v1, __global float* v2) { inti = get_global_id(0); sum[i] = v1[i] + v2[i];} IWOCL 2013 Kiemele
What you get… c = sqrt(add(square(x), square(y)); square square add sqrt IWOCL 2013 Kiemele
What you get… c = sqrt(add(square(x), square(y)); IWOCL 2013 Kiemele
What you WANT! c = sqrt(add(square(x), square(y)); x y add sqrt IWOCL 2013 Kiemele
What you WANT! c = sqrt(add(square(x), square(y)); IWOCL 2013 Kiemele
Two Choices • Modular Implementation • Reusable • Easy to maintain and develop • Individual Kernel optimization • Monolithic Implementation • Performance • Allows for optimizations which will otherwise exist between modules • Can we do both? IWOCL 2013 Kiemele
Introducing KFusion Application File Library File Kernel File square(…) float* square kernel square square(…) add(…) float* add … kernel add … sqrt(…) float* sqrt … kernel sqrt … IWOCL 2013 Kiemele 11
After KFusion… Application File Library File Kernel File square(…) void square … kernel square square(…) add(…) void add … kernel add … sqrt(…) void sqrt … kernel sqrt … New Call:c = fu(…); New Function: float* fu(…) New Kernel: kernel fu(…) IWOCL 2013 Kiemele 12
It works! IWOCL 2013 Kiemele
Road Map • KFusion at work • what and how • …why! • Costs and benefits • annotations, lines of code • modularity, performance • Future work and conclusion • explicit composition of computation around data flow IWOCL 2013 Kiemele
Costs • Annotations • application hints • library synchronization • kernel data flow for compositions • Preprocessor • build dependency graph • source-to-source transformation • loop fusion • deforestation IWOCL 2013 Kiemele
Annotations application #pragma start fuse square(x,x) square(y,y) add(c,x,y) sqrt(c, c) c = sqrt(add(, square(y));#pragma end fuse #pragma sync out public void dot_product(double result, vector x); #pragma sync in public void matrix_vector_mult(vector b, Matrix A, vector x) Library IWOCL 2013 Kiemele
Annotations __kernel void add_vectors(__global float* sum, __global float* v1, __global float* v2) {#pragma kload { inti = get_global_id(0); float arg1 = v1[i]; float arg2 = v2[i]; float s;} s = arg1 + arg2;#pragma kstore{sum[i] = s;}} kernel add IWOCL 2013 Kiemele
Dependency Graph y x square(x) square(y) add(c,x,y) sqrt(c) c IWOCL 2013 Kiemele
Transformation… y x square(x) square(y) add_sqrt(c,x,y) c IWOCL 2013 Kiemele
Replacement Kernel! y x fu(c,x,y) c IWOCL 2013 Kiemele
Annotations AOSD 2013 Kiemele
Benefits IWOCL 2013 Kiemele
Performance IWOCL 2013 Kiemele
Performance IWOCL 2013 Kiemele
Roofline Analysis of Performance • Peak Actual GFlops =minimum(Bandwidth x flops/byte, Peak Performance) • Three Linear Algebra Scenarios • c = sqrt(a2 + b2) • d = sqrt( (x1 – x2)2 + (y1 – y2)2) • Start of conjugate gradient • r = Ax – b • p = r • R2 = r*r AOSD 2013 Kiemele
c = sqrt(a2+ b2) IWOCL 2013 Kiemele
d = sqrt((x1 – x2)2+ (y1 – y2)2) IWOCL 2013 Kiemele
Conjugate Gradient IWOCL 2013 Kiemele
Road Map • KFusion at work • what and how • …why! • Costs and benefits • annotations, lines of code • modularity, performance • Future work and conclusion • explicit composition of computation around data flow AOSD 2013 Kiemele
Future Work kfuse{calls} __kernel void k(…) {kload{ … }computationkstore{… }} • Tools • comprehension and visualization • emulation • performance testing • Combine with other approaches • Optimizing compiles • Code Generators IWOCL 2013 Kiemele
Conclusion • KFusion is a first step towards • explicit, flexible control • Allowing optimizations between modules • separation of concerns • github.com/4Liamk/KFusion/wiki IWOCL 2013 Kiemele