1 / 41

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks. Biagio Cosenza, Ph.D. DPS Group, Institut für Informatik Universit ä t Innsbruck, Austria. Outline. Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning

sani
Download Presentation

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Make HPC Easy with Domain-Specific Languages and High-Level Frameworks Biagio Cosenza, Ph.D. DPS Group, InstitutfürInformatik Universität Innsbruck, Austria

  2. Outline • Complexity in HPC • Parallel hardware • Optimizations • Programming models • Harnessing compexity • Automatic tuning • Automatic parallelization • DSLs • Abstractions for HPC • Related work in Insieme

  3. Complexity in HPC

  4. Complexity in Hardware • The need of parallel computing • Parallelism in hardware • Three walls • Power wall • Memory wall • Instruction Level Parallelism

  5. The Power Wall Power is expensive, but transistors are free • We can put more transistors on a chip than we have the power to turn on • Power efficiency challenge • Performance per watt is the new metric – systems are often constrained by power & cooling • This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance • Example • Intel Pentium 4 HT 670 (released on May 2005) • Clock rate 3.8 GHz • Intel Core i7 3930K Sandy Bridge (released on Nov. 2011) • Clock rate 3.2 GHz

  6. The Memory Wall The growing disparity of speed between CPU and memory outside the CPU chip, would become an overwhelming bottleneck • It change the way we optimize programs • Optimize for memory vs optimize computation • E.g. multiply is no longer considered a harming slow operation, if compared to load and store

  7. The ILP Wall There are diminishing returns on finding more ILP • Instruction Level Parallelism • The potential overlap among instructions • Many ILP techniques • Instruction pipelining • Superscalar execution • Out-of-order execution • Register renaming • Branch prediction • The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible • There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high-performance single-core processor busy

  8. Parallelism in Hardware

  9. The “Many-core” challenges Tilera TILE-Gx807 • Many-core vs multi-core • Multi-core architectures and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000’s of processors • Many-core is the future

  10. What does it mean? • Hardware is evolving • The number of cores is the new Megahertz • We need • New programming model • New system software • New supporting architecture that are naturally parallel

  11. New Challenges • Make easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip • Maximize productivity • Programming models should • be independent of the number of processors • support successful models of parallelism, such as task-level parallelism, word-level parallelism, and bit-level parallelism • “Autotuners” should play a larger role than conventional compilers in translating parallel programs

  12. Parallel Programming Models Real-Time Worksop(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg) Pthreads Erlang Charm(Illinois) MPI Cilk(MIT) HMPP OpenMP OpenAcc MapReduce(Google) OpenCL(Khronos Group) Brook(Stanford) DataCutter(Maryland) CUDA(NVidia) NESL(CMU) StreamIt(MIT&Microsoft) Borealis(Brown) HPCS Chapel(Cray) HPCS Fortress(Sun) Thread Building Blocks(Intel) HPCS X10(IBM) Sequoia(Stanford)

  13. Parallel Programming Models Real-Time Worksop(MathWorks) Binary Modular Data Flow Machine(TU Munich and AS Nuremberg) Pthreads Erlang Charm(Illinois) MPI Cilk(MIT) HMPP OpenMP OpenAcc MapReduce(Google) OpenCL(Khronos Group) Brook(Stanford) DataCutter(Maryland) CUDA(NVidia) NESL(CMU) StreamIt(MIT&Microsoft) Borealis(Brown) HPCS Chapel(Cray) HPCS Fortress(Sun) Thread Building Blocks(Intel) HPCS X10(IBM) Sequoia(Stanford)

  14. Reconsidering… • Applications • What are common parallel kernel applications? • Parallel patterns • Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns • A parallel pattern (“dwarf”) is an algorithmic method that captures a pattern of computation and communication • E.g. dense linear algebra, sparse algebra, spectral methods, … • Metrics • Scalability • An old belief was that less than linear scaling for a multi-processor application is failure • With new hardware trend, this is no longer true • Any speedup is OK!

  15. Harnessing Complexity

  16. Harnessing Complexity • Compiler approaches • DSL, automatic parallelization, … • Library-based approaches

  17. What a compiler can do for us? • Optimize code • Automatic tuning • Automatic code generation • e.g. in order to support different hardware • Automatically parallelize code

  18. Automatic Parallelization Critical opinions on parallel programming model: The other way: • Auto-parallelizing compilers • Sequential code => parallel code Wen-meiHwu, University of Illinois, Urbana-ChampaignWhy sequential programming models could be the best way to program many-core systems http://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf

  19. Automatic Parallelization for(inti=0;i<100;i++) { A[i] = A[i+1]; } • Nowadays compilers have new “tools” for analysis • Polyhedral model • …but performance • are still far from a manual parallelization approach IR • Polyhedral extraction: • SCoP detection • Translation to polyhedral Polyhedral Model D: { i in N: 0 <= i < 100 } R: A[ i] for each i in D W: A[i+1] for each i in D • Code generation: • Generate IR code from model Analyses & Transformations

  20. Autotunersvs Traditional Compilers • Performance of future parallel applications will crucially depend on the quality of the code generated by the compiler • The compiler selects which optimizations to perform, chooses parameters for these optimizations, and selects from among alternative implementations of a library kernel • The resulting space of optimization is large • Programming model may simplify the problem • but not to solve it

  21. Optimizations’ ComplexityAn example Input • Openmp code • Simple parallel codes • matrix multiplication, jacobi, stencil3d,… • Few optimizations and tuning parameters • Tiling 2d/3d • # of threads Goal: Optimize for performance and efficiency

  22. Optimizations’ ComplexityAn example • Problem • Big search space • brute force takes year of computation • Analytical model fails to find the best configuration • Solution • Multi-objective search • Offline search of Pareto front solutions • Runtime selection according to the objective • Multi versioning H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes ACM Super Computing, 2012

  23. Optimizations’ Complexity Input Code compile time runtime 5 1 Analyzer Backend Multi-Versioned Code 4 2 CodeRegions DynamicSelection BestSolutions 6 Optimizer Runtime System Measure- ments Parallel Target Platform Configurations 3 H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, H. MoritschA Multi-Objective Auto-Tuning Framework for Parallel Codes ACM Super Computing, 2012

  24. Domain Specific Languages • Easy of programming • Use of domain specific concepts • E.g. “color”, “pixel”, “particle”, “atom” • Simple interface • Hide complexity • Data structures • Parallelization issues • Optimizations’ tuning • Address specific parallelization pattern

  25. Domain Specific Languages • DSL may help parallelization • Focus on domain concepts and abstractions • Language constraints may help automatic parallelization by compilers • 3 major benefits • Productivity • Performance • Portability and forward scalability

  26. Domain Specific LanguagesGLSL Shader (OpenGL) OpenGL 4.3 Pipeline VertexData VertexShader Primitive Setup and Rasterization FragmentShader Blending TessellationEvaluationShader TessellationControlShader GeometryShader TextureStore PixelData

  27. attribute vec3 vertex; attribute vec3 normal; attribute vec2 uv1; uniform mat4 _mvProj; uniform mat3 _norm; varying vec2 vUv; varying vec3 vNormal; void main(void) { // compute position gl_Position = _mvProj * vec4(vertex, 1.0); vUv = uv1; // compute light info vNormal= _norm * normal; } varying vec2 vUv; varying vec3 vNormal; uniform vec3 mainColor; uniform float specularExp; uniform vec3 specularColor; uniform sampler2D mainTexture; uniform mat3 _dLight; uniform vec3 _ambient; void getDirectionalLight(vec3 normal, mat3 dLight, float specularExp, out vec3 diffuse, out float specular){ vec3 ecLightDir = dLight[0]; // light direction in eye coordinates vec3 colorIntensity = dLight[1]; vec3 halfVector = dLight[2]; float diffuseContribution = max(dot(normal, ecLightDir), 0.0); float specularContribution = max(dot(normal, halfVector), 0.0); specular = pow(specularContribution, specularExponent); diffuse = (colorIntensity * diffuseContribution); } void main(void) { vec3 diffuse; float spec; getDirectionalLight(normalize(vNormal), _dLight, specularExp, diffuse, spec); vec3 color = max(diffuse,_ambient.xyz)*mainColor; gl_FragColor = texture2D(mainTexture,vUv) * vec4(color,1.0) + vec4(specular*specularColor,0.0); } vertex fragment VertexData VertexShader Primitive Setup and Rasterization FragmentShader Blending TessellationEvaluationShader TessellationControlShader GeometryShader TextureStore PixelData

  28. DSL Examples Matlab, DLA DSL (dense linear algebra), Python, shell script, SQL, XML, CSS, BPEL, … • Interesting recent research work Leo A. Meyerovich, Matthew E. Torok, Eric Atkinson, RastislavBodıkSuperconductor: A Language for Big Data Visualization LASH-C 2013 ChariseeChiw, Gordon Kindlmann, John Reppy, Lamont Samuels, Nick Seltzer Diderot: A Parallel DSL for Image Analysis and Visualization ACM PLDI 2012 A. S. Green, P. L. Lumsdaine, N. J. Ross, and B. Valiron Quipper: A Scalable Quantum Programming Language ACM PLDI 2013

  29. Harnessing Complexity • Compilers can do • Automatic parallelization • Optimization of (parallel) code • DSL and code generation • But well written and optimized parallel code still outperforms a compiler based approach

  30. Harnessing Complexity • Compiler approaches • DSL, automatic parallelization, … • Library-based approaches

  31. Some Examples • Pattern oriented • MapReduce (Google) • Problem specific • FLASH, adaptive-mesh refinement (AMR) code • GROMACS, molecular dynamics • Hardware/programming model specific • Cactus • libWater* best performance

  32. Insieme Compiler and Research • Compiler infrastructure • Runtime support

  33. Insieme Research: Automatic Task Partitioning for Heterogeneous HW • Heterogeneous platforms • E.g. CPU + 2 GPUs • Input: OpenCL for single device • Output: OpenCL code for multiple devices • Automatic partitioning of work-items between multiple devices • Based on hw, program and input size • Machine-learning approach K. Kofler, I. Grasso, B. Cosenza, T. FahringerAn Automatic Input-Sensitive Approach for Heterogeneous Task PartitioningACM International Conference on Supercomputing, 2013

  34. Results – Architecture 1

  35. Results – Architecture 2

  36. Insieme Research: OpenCL on Cluster of Heterogeneous Nodes • libWater • OpenCL extensions for clusters • Event based, extension on OpenCL event • Supporting intra-deice synchronization • DQL • A DSL language for device query, management and discovery I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer libWater: Heterogeneous Distributed Copmuting Made EasyACM International Conference on Supercomputing, 2013

  37. libWater • Runtime • OpenCL • pthread, opemp • MPI • DAG command event representation

  38. libWater: DAG Optimizations • Dynamic Collective communication pattern Replacement (DCR) • Latency hiding • Intra-node copy optimizations

  39. Insieme (Ongoing) Research:Support for DSLs Library SupportRendering algorithm implementations, geometry loader, … RuntimeSystem Frontend Backend InputCodes InputCodes IntermediateRepresentation DSL OutputCodes pthreads OpenCLMPI Target hardware:GPU, CPU, heterogeneous platform, compute cluster Transformation Framework Polyhedral model Parallel optimizations Stencil computation Automatic tuning support

  40. About Insieme • Insieme compiler • Research framework • OpenMP, Cilk, MPI, OpenCL • Run time, IR • Support for polyhedral model • Multi-objective optimization • Machine learning • Extensible • Insieme (GPL) and libWater (LGPL) soon available on GitHub

More Related