170 likes | 374 Views
Efficiency Programming for the (Productive) Masses. Armando Fox , Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson, Krste Asanovic, Dave Patterson, Kurt Keutzer UC Berkeley Parallel Computing Lab/UPCRC.
E N D
Efficiency Programming for the (Productive) Masses Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson, Krste Asanovic, Dave Patterson, Kurt Keutzer UC Berkeley Parallel Computing Lab/UPCRC
Make productivity programmers efficient, and efficiency programmers productive? • Productivity level language (PLL): Python, Ruby • high-level abstractions well-matched to application domain => 5x faster development and 3-10x fewer lines of code • >90% of programmers • Efficiency level language (ELL): C/C++, CUDA, OpenCL • >5x longer development time • potential 10x-100x performance by exposing HW model • <10% programmers, yet their work is poorly reused • 5x development time 10x-100x performance! Raise level of abstraction and get performance?
Capture patterns instead of “domains”? • Efficiency programmers know how to target computation patterns to hardware • stencil/SIMD codes => GPUs • sparse matrix => communication-avoiding algo’s on multicore • “Big finance” Monte Carlo sim => MapReduce • Libraries? Useful, but don’t raise abstraction level • How to make ELL work accessible to more PLL programmers?
“Stovepipes”: Connect Pattern to Platform Robotics Data viz. Virt. worlds Music App domains Computation domains Language Thick Runtime Hardware Probabilistic Rendering Physics Lin. Alg. Traditional Layers Common language substrate Runtime & OS Cloud FPGA SIMD GPU OOO Humans must produce these Robotics Music Data viz. Virt. worlds Applications Motifs/Patterns Thin Runtime Hardware Dense Matrix Sparse Matrix Stencil “Stovepipes” Stencil to SIMD Dense to OoO Dense to GPU Stencil to FPGA Runtime & OS Cloud FPGA SIMD GPU OOO
SEJITS: Selective, Embedded Just-in-Time Specialization • Productivity programmers write in general purpose, modern, high level PLL • SEJITS infrastructure specializes computation patterns selectively at runtime • Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware • Embedded because PLL’s own machinery enables (vs. extending PLL interpreter)
Specifically... • When “specializable” function is called: • determine if specializer available for current platform • if no: continue executing normally in PLL • If a specializer is found, it can: • manipulate/traverse AST of the function • emit & JIT-compile ELL source code • dynamically link compiled code to PLL interpreter • Specializers written in PLL • Necessary features present in modern PLL’s, but absent from older widely-used PLL’s
SEJITS makes tuning decisions per-function (not per-app) Productivity app .py @g() .c f() @h() cc/ld $ PLL Interp SEJITS .so Specializer OS/HW
SEJITS makes tuning decisions per-function (not per-app) Selective Productivity app .py JIT Embedded @g() .c Specialization f() @h() cc/ld $ PLL Interp SEJITS .so Specializer OS/HW
Example: Stencil Computation in Ruby class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end end end Use introspection to grab parameters, inspect AST of computation VALUE kern_par(int argc, VALUE* argv, VALUE self) { unpack_arrays into in_grid and out_grid; #pragma omp parallel for default(shared) private (t_6,t_7,t_8) for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)])); ... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)])); ;}}} return Qtrue;} • Specializer emits OpenMP • 1000x-2000x faster than Ruby
Example: Sparse Matrix-Vector Multiply in Python • Specializer outputs CUDA for nvcc: • SEJITS leverages downstream toolchains B. Catanzaro et al., joint work with NVIDIA Research # “Gather nonzero entries, # multiply them by vector, # do for each column”
SEJITS in the Cloud Productivity app .py @g() Spark & Nexus • Spark enables cloud- distributed, persistent, fault-tolerant shared parallel data structures • Relies on Scala runtime and data-parallel abstractions • Relies on Nexus (cloud resource management) layer .scala f() @h() scalac $ PLL Interp SEJITS Sparkworker Specializer Nexus on Eucalyptus or EC2
Example: Logistic regression using Spark/Scala (in progress) M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09 B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘09
SEJITS in the Cloud Productivity app .py @g() .java f() @h() javac $ PLL Interp SEJITS Hadoop master Specializer Nexus on Cloud
SEJITS for Cloud Computing Idea: same Python app runs on desktop, on manycore, and in cloud • Cloud/multicore synergy: specialize intra-node as well as generate cloud code • Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C), ... • Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP, ... • Combine abstractions in one app • Remember...can always fall back to PLL
Questions • Won’t we need lots & lots of specializers? • if ParLab “motifs” bet is correct, ~10s of specializers will go a long way • What about libraries, frameworks, etc.? • SEJITS is complementary to frameworks • Most libraries for ELL, and ELLs lack features that promote code reuse, don’t raise abstraction level • Why isn’t this just as hard as “magic compiler”? • Specializers written by human experts • SEJITS allows “crowdsourcing” them • Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.?
Conclusion • SEJITS enables code-generation strategy per-function, not per-app • Uniform approach to productive programming • same app on cloud, multicore, autotuned libraries • Combine multiple frameworks/abstractions in same app • Research enabler • Incrementally develop specializers for different motifs or prototype HW • Don’t need full compiler & toolchain just to get started