Efficiency Programming for the (Productive) Masses

Efficiency Programming for the (Productive) Masses Armando Fox, Bryan Catanzaro, Shoaib Kamil, Yunsup Lee, Ben Carpenter, Erin Carson, Krste Asanovic, Dave Patterson, Kurt Keutzer UC Berkeley Parallel Computing Lab/UPCRC

Make productivity programmers efficient, and efficiency programmers productive? • Productivity level language (PLL): Python, Ruby • high-level abstractions well-matched to application domain => 5x faster development and 3-10x fewer lines of code • >90% of programmers • Efficiency level language (ELL): C/C++, CUDA, OpenCL • >5x longer development time • potential 10x-100x performance by exposing HW model • <10% programmers, yet their work is poorly reused • 5x development time 10x-100x performance! Raise level of abstraction and get performance?

Capture patterns instead of “domains”? • Efficiency programmers know how to target computation patterns to hardware • stencil/SIMD codes => GPUs • sparse matrix => communication-avoiding algo’s on multicore • “Big finance” Monte Carlo sim => MapReduce • Libraries? Useful, but don’t raise abstraction level • How to make ELL work accessible to more PLL programmers?

“Stovepipes”: Connect Pattern to Platform Robotics Data viz. Virt. worlds Music App domains Computation domains Language Thick Runtime Hardware Probabilistic Rendering Physics Lin. Alg. Traditional Layers Common language substrate Runtime & OS Cloud FPGA SIMD GPU OOO Humans must produce these Robotics Music Data viz. Virt. worlds Applications Motifs/Patterns Thin Runtime Hardware Dense Matrix Sparse Matrix Stencil “Stovepipes” Stencil to SIMD Dense to OoO Dense to GPU Stencil to FPGA Runtime & OS Cloud FPGA SIMD GPU OOO

SEJITS: Selective, Embedded Just-in-Time Specialization • Productivity programmers write in general purpose, modern, high level PLL • SEJITS infrastructure specializes computation patterns selectively at runtime • Specialization uses runtime info to generate and JIT-compile ELL code targeted to hardware • Embedded because PLL’s own machinery enables (vs. extending PLL interpreter)

Specifically... • When “specializable” function is called: • determine if specializer available for current platform • if no: continue executing normally in PLL • If a specializer is found, it can: • manipulate/traverse AST of the function • emit & JIT-compile ELL source code • dynamically link compiled code to PLL interpreter • Specializers written in PLL • Necessary features present in modern PLL’s, but absent from older widely-used PLL’s

SEJITS makes tuning decisions per-function (not per-app) Productivity app .py @g() .c f() @h() cc/ld $ PLL Interp SEJITS .so Specializer OS/HW

SEJITS makes tuning decisions per-function (not per-app) Selective Productivity app .py JIT Embedded @g() .c Specialization f() @h() cc/ld $ PLL Interp SEJITS .so Specializer OS/HW

Example: Stencil Computation in Ruby class LaplacianKernel < Kernel def kernel(in_grid, out_grid) in_grid.each_interior do |point| point.neighbors(1).each do |x| out_grid[point] += 0.2*x.val end end end Use introspection to grab parameters, inspect AST of computation VALUE kern_par(int argc, VALUE* argv, VALUE self) { unpack_arrays into in_grid and out_grid; #pragma omp parallel for default(shared) private (t_6,t_7,t_8) for (t_8=1; t_8<256-1; t_8++) { for (t_7=1; t_7<256-1; t_7++) { for (t_6=1; t_6<256-1; t_6++) { int center = INDEX(t_6,t_7,t_8); out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)])); ... out_grid[center] = (out_grid[center] +(0.2*in_grid[INDEX(t_6,t_7,t_8+1)])); ;}}} return Qtrue;} • Specializer emits OpenMP • 1000x-2000x faster than Ruby

Example: Sparse Matrix-Vector Multiply in Python • Specializer outputs CUDA for nvcc: • SEJITS leverages downstream toolchains B. Catanzaro et al., joint work with NVIDIA Research # “Gather nonzero entries, # multiply them by vector, # do for each column”

SEJITS in the Cloud Productivity app .py @g() Spark & Nexus • Spark enables cloud- distributed, persistent, fault-tolerant shared parallel data structures • Relies on Scala runtime and data-parallel abstractions • Relies on Nexus (cloud resource management) layer .scala f() @h() scalac $ PLL Interp SEJITS Sparkworker Specializer Nexus on Eucalyptus or EC2

Example: Logistic regression using Spark/Scala (in progress) M. Zaharia et al., Spark: Cluster Computing With Working Sets, HotCloud’09 B. Hindman et al., Nexus: A Common Substrate for Cluster Computing, HotCloud‘09

SEJITS in the Cloud Productivity app .py @g() .java f() @h() javac $ PLL Interp SEJITS Hadoop master Specializer Nexus on Cloud

SEJITS for Cloud Computing Idea: same Python app runs on desktop, on manycore, and in cloud • Cloud/multicore synergy: specialize intra-node as well as generate cloud code • Cloud: Emit JIT-able code for Spark (Scala), Hadoop (Java), MPI (C), ... • Single node: Emit JIT-able code for OpenCL, CUDA, OpenMP, ... • Combine abstractions in one app • Remember...can always fall back to PLL

Questions • Won’t we need lots & lots of specializers? • if ParLab “motifs” bet is correct, ~10s of specializers will go a long way • What about libraries, frameworks, etc.? • SEJITS is complementary to frameworks • Most libraries for ELL, and ELLs lack features that promote code reuse, don’t raise abstraction level • Why isn’t this just as hard as “magic compiler”? • Specializers written by human experts • SEJITS allows “crowdsourcing” them • Will programmers accustomed to Matlab/Fortran learn functional style, list comprehensions, etc.?

Conclusion • SEJITS enables code-generation strategy per-function, not per-app • Uniform approach to productive programming • same app on cloud, multicore, autotuned libraries • Combine multiple frameworks/abstractions in same app • Research enabler • Incrementally develop specializers for different motifs or prototype HW • Don’t need full compiler & toolchain just to get started

Questions

Efficiency Programming for the (Productive) Masses