410 likes | 545 Views
Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning
E N D
Make HPC Easy with Domain-Specific Languages and High-Level Frameworks Biagio Cosenza, Ph.D. DPS Group, InstitutfürInformatik Universität Innsbruck, Austria
Outline • Complexity in HPC • Parallel hardware • Optimizations • Programming models • Harnessing compexity • Automatic tuning • Automatic parallelization • DSLs • Abstractions for HPC • Related work in Insieme
Complexity in Hardware • The need of parallel computing • Parallelism in hardware • Three walls • Power wall • Memory wall • Instruction Level Parallelism
The Power Wall Power is expensive, but transistors are free • We can put more transistors on a chip than we have the power to turn on • Power efficiency challenge • Performance per watt is the new metric – systems are often constrained by power & cooling • This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance • Example • Intel Pentium 4 HT 670 (released on May 2005) • Clock rate 3.8 GHz • Intel Core i7 3930K Sandy Bridge (released on Nov. 2011) • Clock rate 3.2 GHz
The Memory Wall The growing disparity of speed between CPU and memory outside the CPU chip, would become an overwhelming bottleneck • It change the way we optimize programs • Optimize for memory vs optimize computation • E.g. multiply is no longer considered a harming slow operation, if compared to load and store
The ILP Wall There are diminishing returns on finding more ILP • Instruction Level Parallelism • The potential overlap among instructions • Many ILP techniques • Instruction pipelining • Superscalar execution • Out-of-order execution • Register renaming • Branch prediction • The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible • There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy
The “Many-core” challenges Tilera TILE-Gx807 • Many-core vs multi-core • Multi-core architectures and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors • Many-core is the future
What does it mean? • Hardware is evolving • The number of cores is the new Megahertz • We need • New programming model • New system software • New supporting architecture that are naturally parallel
New Challenges • Make easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip • Maximize productivity • Programming models should • be independent of the number of processors • support successful models of parallelism, such as task-level parallelism, word-level parallelism, and bit-level parallelism • “Autotuners” should play a larger role than conventional compilers in translating parallel programs
Parallel Programming Models Real-Time Worksop(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg) Pthreads Erlang Charm(Illinois) MPI Cilk(MIT) HMPP OpenMP OpenAcc MapReduce(Google) OpenCL(Khronos Group) Brook(Stanford) DataCutter(Maryland) CUDA(NVidia) NESL(CMU) StreamIt(MIT&Microsoft) Borealis(Brown) HPCS Chapel(Cray) HPCS Fortress(Sun) Thread Building Blocks(Intel) HPCS X10(IBM) Sequoia(Stanford)
Parallel Programming Models Real-Time Worksop(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg) Pthreads Erlang Charm(Illinois) MPI Cilk(MIT) HMPP OpenMP OpenAcc MapReduce(Google) OpenCL(Khronos Group) Brook(Stanford) DataCutter(Maryland) CUDA(NVidia) NESL(CMU) StreamIt(MIT&Microsoft) Borealis(Brown) HPCS Chapel(Cray) HPCS Fortress(Sun) Thread Building Blocks(Intel) HPCS X10(IBM) Sequoia(Stanford)
Reconsidering… • Applications • What are common parallel kernel applications? • Parallel patterns • Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns • A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication • E.g. dense linear algebra, sparse algebra, spectral methods, … • Metrics • Scalability • An old belief was that less than linear scaling for a multi-processor application is failure • With new hardware trend, this is no longer true • Any speedup is OK!
Harnessing Complexity • Compiler approaches • DSL, automatic parallelization, … • Library-based approaches
What a compiler can do for us? • Optimize code • Automatic tuning • Automatic code generation • e.g. in order to support different hardware • Automatically parallelize code
Automatic Parallelization Critical opinions on parallel programming model: The other way: • Auto-parallelizing compilers • Sequential code => parallel code Wen-meiHwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core systems http://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf
Automatic Parallelization for(inti=0;i<100;i++) { A[i] = A[i+1]; } • Nowadays compilers have new “tools” for analysis • Polyhedral model • …but performance • are still far from a manual parallelization approach IR • Polyhedral extraction: • SCoP detection • Translation to polyhedral Polyhedral Model D: { i in N: 0 <= i < 100 } R: A[ i] for each i in D W: A[i+1] for each i in D • Code generation: • Generate IR code from model Analyses & Transformations
Autotunersvs Traditional Compilers • Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler • The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel • The resulting space of optimization is large • Programming model may simplify the problem • but not to solve it
Optimizations’ ComplexityAn example Input • Openmp code • Simple parallel codes • matrix multiplication, jacobi, stencil3d,… • Few optimizations and tuning parameters • Tiling 2d/3d • # of threads Goal: Optimize for performance and efficiency
Optimizations’ ComplexityAn example • Problem • Big search space • brute force takes year of computation • Analytical model fails to find the best configuration • Solution • Multi-objective search • Offline search of Pareto front solutions • Runtime selection according to the objective • Multi versioning H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes ACM Super Computing, 2012
Optimizations’ Complexity Input Code compile time runtime 5 1 Analyzer Backend Multi-Versioned Code 4 2 CodeRegions DynamicSelection BestSolutions 6 Optimizer Runtime System Measure- ments Parallel Target Platform Configurations 3 H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes ACM Super Computing, 2012
Domain Specific Languages • Easy of programming • Use of domain specific concepts • E.g. “color”, “pixel”, “particle”, “atom” • Simple interface • Hide complexity • Data structures • Parallelization issues • Optimizations’ tuning • Address specific parallelization pattern
Domain Specific Languages • DSL may help parallelization • Focus on domain concepts and abstractions • Language constraints may help automatic parallelization by compilers • 3 major benefits • Productivity • Performance • Portability and forward scalability
Domain Specific LanguagesGLSL Shader (OpenGL) OpenGL 4.3 Pipeline VertexData VertexShader Primitive Setup and Rasterization FragmentShader Blending TessellationEvaluationShader TessellationControlShader GeometryShader TextureStore PixelData
attribute vec3 vertex; attribute vec3 normal; attribute vec2 uv1; uniform mat4 _mvProj; uniform mat3 _norm; varying vec2 vUv; varying vec3 vNormal; void main(void) { // compute position gl_Position = _mvProj * vec4(vertex, 1.0); vUv = uv1; // compute light info vNormal= _norm * normal; } varying vec2 vUv; varying vec3 vNormal; uniform vec3 mainColor; uniform float specularExp; uniform vec3 specularColor; uniform sampler2D mainTexture; uniform mat3 _dLight; uniform vec3 _ambient; void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){ vec3 ecLightDir = dLight[0]; // light direction in eye coordinates vec3 colorIntensity = dLight[1]; vec3 halfVector = dLight[2]; float diffuseContribution = max(dot(normal, ecLightDir), 0.0); float specularContribution = max(dot(normal, halfVector), 0.0); specular = pow(specularContribution, specularExponent); diffuse = (colorIntensity * diffuseContribution); } void main(void) { vec3 diffuse; float spec; getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec); vec3 color = max(diffuse,_ambient.xyz)*mainColor; gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0); } vertex fragment VertexData VertexShader Primitive Setup and Rasterization FragmentShader Blending TessellationEvaluationShader TessellationControlShader GeometryShader TextureStore PixelData
DSL Examples Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, … • Interesting recent research work Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, RastislavBodıkSuperconductor: A Language for Big Data Visualization LASH-C 2013 ChariseeChiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick Seltzer Diderot: A Parallel DSL for Image Analysis and Visualization ACM PLDI 2012 A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron Quipper: A Scalable Quantum Programming Language ACM PLDI 2013
Harnessing Complexity • Compilers can do • Automatic parallelization • Optimization of (parallel) code • DSL and code generation • But well written and optimized parallel code still outperforms a compiler based approach
Harnessing Complexity • Compiler approaches • DSL, automatic parallelization, … • Library-based approaches
Some Examples • Pattern oriented • MapReduce (Google) • Problem specific • FLASH, adaptive-mesh refinement (AMR) code • GROMACS, molecular dynamics • Hardware/programming model specific • Cactus • libWater* best performance
Insieme Compiler and Research • Compiler infrastructure • Runtime support
Insieme Research: Automatic Task Partitioning for Heterogeneous HW • Heterogeneous platforms • E.g. CPU + 2 GPUs • Input: OpenCL for single device • Output: OpenCL code for multiple devices • Automatic partitioning of work-items between multiple devices • Based on hw, program and input size • Machine-learning approach K. Kofler, I. Grasso, B. Cosenza, T. FahringerAn Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013
Insieme Research: OpenCL on Cluster of Heterogeneous Nodes • libWater • OpenCL extensions for clusters • Event based, extension on OpenCL event • Supporting intra-deice synchronization • DQL • A DSL language for device query, management and discovery I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer libWater: Heterogeneous Distributed Copmuting Made EasyACM International Conference on Supercomputing, 2013
libWater • Runtime • OpenCL • pthread, opemp • MPI • DAG command event representation
libWater: DAG Optimizations • Dynamic Collective communication pattern Replacement (DCR) • Latency hiding • Intra-node copy optimizations
Insieme (Ongoing) Research:Support for DSLs Library SupportRendering algorithm implementations, geometry loader, … RuntimeSystem Frontend Backend InputCodes InputCodes IntermediateRepresentation DSL OutputCodes pthreads OpenCLMPI Target hardware:GPU, CPU, heterogeneous platform, compute cluster Transformation Framework Polyhedral model Parallel optimizations Stencil computation Automatic tuning support
About Insieme • Insieme compiler • Research framework • OpenMP, Cilk, MPI, OpenCL • Run time, IR • Support for polyhedral model • Multi-objective optimization • Machine learning • Extensible • Insieme (GPL) and libWater (LGPL) soon available on GitHub