1 / 34

CS 395T: Program Synthesis for Heterogeneous Parallel Computers

CS 395T: Program Synthesis for Heterogeneous Parallel Computers. Administration. Instructor: Keshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: Michael He Graduate student (CS) Office: POB 4.104 Email: hejy@cs.utexas.edu. Website for course

ktommie
Download Presentation

CS 395T: Program Synthesis for Heterogeneous Parallel Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 395T:Program Synthesis forHeterogeneous Parallel Computers

  2. Administration • Instructor: Keshav Pingali • Professor (CS, ICES) • Office: POB 4.126A • Email: pingali@cs.utexas.edu • TA: Michael He • Graduate student (CS) • Office: POB 4.104 • Email: hejy@cs.utexas.edu • Website for course • http://www.cs.utexas.edu/users/pingali/CS395T-2017/index.html

  3. Meeting times • Lecture: • TTh 12:30-2:00PM, GDC 2.210 • Office hours: • Keshav Pingali: Tuesday 3-4 PM, POB 4.126

  4. Prerequisites • Compilers and architecture • Graduate level knowledge • Compilers: CS 380C, Architecture: H&P book • Software and math maturity • Able to implement large programs • Familiarity with concepts like SAT solvers and if not, the ability to learn that material on your own • Research maturity • Ability to read papers on your own and understand the key ideas

  5. Course organization • This is a paper reading course • Papers: • In every class, we will discuss one or two papers • One student will give a 30min-45min presentation of the content of these papers at the beginning of each class • Rest of class time will be devoted to a discussion of the paper(s) • Website will have the papers and discussion order

  6. Coursework • Writing assignments (20% of grade) • Everyone is expected to read all papers • Reading reports: submit as private note to Michael on Piazza. See Michael’s sample on website. • Deadline: Sunday 11:59 PM for papers that we will discuss on the following Tuesday and Thursday. • Class participation (20% of grade) • Term project (60% of grade) • Substantial implementation project • Based on our ideas or yours • Work alone or in pairs

  7. Course concerns • Two main concerns in programming • Performance • Productivity • These concerns are often in tension • Performance • Modern processors are very complex • Getting performance may require low-level programming • Productivity • Improving productivity requires higher levels of abstraction in application programs • How can we get both productivity and performance? • Need powerful approaches to convert high-level abstract programs to low-level efficient programs

  8. Performance

  9. Complexity of modern processors: Intel KNL

  10. Getting performance • Exploit parallelism at multiple levels • Find thread-level parallelism • Keep the 72 cores busy • Hyper-threading: you actually have 288 threads • Find SIMD parallelism • Cores have vector units • Find instruction-level parallelism (ILP) • Cores are pipelined • Load-balancing • Assign work to cores to keep them all busy • Exploit locality at multiple levels • L1 and L2 caches on each tile (pair of cores) • Network locality between tiles • Distributed-memory machine (Stampede II, 10 Pflops) • network locality between hosts • Getting performance • Usually requires low-level programming using pThreads/OpenMPwith vector intrinsics and MPI for distributed-memory • Tension with productivity

  11. Productivity

  12. Productivity • Most important advances in PL have introduced new abstractions for productivity • Examples: • Procedures (1950?) • abstraction: parameterized code module (l-abstraction) • abstracted away: implementation code • Processor architecture (IBM 360) • abstraction: machine language • abstracted away: implementation of machine language • FORTRAN I (1957) • abstraction: high-level programming language • abstracted away: machine language

  13. Abstractions in PL (contd.) • Examples (contd.): • Structured programming (1967) • abstraction: structured control-flow constructs like if-then-else, while-loops, for-loops etc. • abstracted away: conditional jumps (machine language relic) • Object-oriented programming (1970-) • abstraction: abstract data type • abstracted away: representation of data type • Automatic storage management (1960-) • abstraction: objects • abstracted away: machine addresses (pointers)

  14. Abstractions for parallelism? • What are the right abstractions for parallel programming? • Very difficult problem: roughly 50 years of work but no agreement • Lots of proposals: • Functional languages, dataflow languages: Dennis, Arvind • Logic programming languages: Warren, Ueda • Concurrent Sequential Processes (CSP): Hoare • Bulk-synchronous parallel (BSP) programming: Valiant • Unity: Chandy/Misra

  15. What we will study • Domain-independent abstraction • Operator formulation of algorithms (my group) • Motto: Parallel program = Operator + Schedule + Parallel data structures • Domain-specific abstractions • most problem domains have some underlying computational algebra (e.g. databases and relational algebra) • programs are expressions in that algebra and can be optimized using algebraic identities

  16. Operator formulation: example • Parallelism: • Bad triangles whose cavities do not overlap can be processed in parallel • Parallelism must be found at runtime • Data-centric view of algorithm • Active elements: bad triangles • Local view: operator applied to bad triangle: {Find cavity of bad triangle (blue); Remove triangles in cavity; Retriangulate cavity and update mesh;} • Global view: schedule • Algorithm = Operator + Schedule • Parallel data structures • Graph • Worklist of bad triangles Delaunay mesh refinement Red Triangle: badly shaped triangle Blue triangles: cavity of bad triangle

  17. Example: Graph analytics A 5 ∞ 5 0 B • Single-source shortest-path problem • Many algorithms • Dijkstra (1959) • Bellman-Ford (1957) • Chaotic relaxation (1969) • Delta-stepping (1998) • Common structure: • Each node has distance label d • Operator: relax-edge(u,v): if d[v] > d[u]+length(u,v) then d[v]  d[u]+length(u,v) • Active node: unprocessed node whose distance field has been lowered • Different algorithms use different schedules • Schedules differ in parallelism, locality, work efficiency 2 7 E ∞ ∞ 3 ∞ 2 9 C 1 G D ∞ 2 4 1 ∞ ∞ 2 F H

  18. Example: Stencil computation • Finite-difference computation • Algorithm • Active nodes: nodes in At+1 • Operator: five-point stencil • Different schedules have different locality • Regular application • Grid structure and active nodes known statically • Application can be parallelized at compile-time At At+1 Jacobi iteration, 5-point stencil “Data-centric multilevel blocking” Kodukula et al, PLDI 1997.

  19. Operator formulation of algorithms • Active element • Node /edge where computation is needed • Local view: operator • Update at active element • Activity: application of operator to active element • Neighborhood: Set of nodes/edges read/written by activity • Global view: schedule • Unordered algorithms: no semantic constraints but performance may depend on schedule • Ordered algorithms: problem-dependent order • Amorphous data-parallelism • Multiple active nodes can be processed in parallel subject to neighborhood and ordering constraints : active node : neighborhood Parallel program = Operator + Schedule + Parallel data structure

  20. SSSP in Elixir Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int)] Graph type relax = [ nodes(node a, dist ad) nodes(node b, dist bd) edges(src a, dst b, wt w) bd > ad + w ] ➔ [ bd = ad + w ] Operator Statement sssp = iterate relax ≫schedule “Synthesizing parallel graph programs via automated planning” Dimitris Prountzos, Roman Manevich, Keshav Pingali, PLDI 2015

  21. Domain-specific parallel programming abstractions • Spiral: Jose Moura et al (CMU) • specification: linear transforms like DFT • implementation: optimized divide-and-conquer implementations for heterogeneous hardware • Stencil compilers: Steele (Thinking Machines), Leiserson(MIT),… • specification: finite-difference stencil • implementation: space and time-tiled program for Jacobi iteration • Tensor-contraction engine: Sadayappan (OSU), Ramanujam (LSU) • specification: tensor computations • implementation: algebraic simplifications, tiled programs • SQL: Codd (IBM) • specification: relational algebra expressions • implementation: algebraic simplifications, storage optimizations

  22. From productivity to performance

  23. Example: matrix multiplication for I = 1, N //assume arrays stored in row-major order for J = 1, N for K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) • All six loop permutations are computationally equivalent (even modulo round-off error). • However, execution times of the six versions can be very different if machine has a cache. • All six versions perform poorly compared to blocked algorithms

  24. Performance of MMM code produced by Intel’s Itanium compiler (-O3) 92% of Peak Performance Goto BLAS obtains close to 99% of peak, so compiler is pretty good. What are the key steps in getting performance?

  25. Loop tiling/blocking Jt B J for It = 1,N, t for Jt = 1,N,t for Kt = 1,N,t for I = It,It+t-1 for J = Jt,Jt+t-1 for K = Kt,Kt+t-1 C(I,J) = C(I,J)+A(I,K)*B(K,J) A It t t I t t K C Kt • Break big MMM into sequence of smaller MMMs where each smaller MMM multiplies sub-matrices of size txt. • What if matrix size is not a multiple of tile size t? • Parameter t (tile size) must be chosen carefully • as large as possible • working set of small matrix multiplication must fit in cache • Must block for multiple levels of cache and registers

  26. Choosing block/tile sizes • Two problems • Dependences may prevent some kinds of tiling • Optimal tile size depends on cache capacity and line size • Abstraction: constrained optimization problem • Constraint: tiling must not violate program dependences • Optimization: find best-performing one from among these

  27. What we will study • Approaches to solving these kinds of constrained optimization problems • Auto-tuning: generate program variants and run on machine to find the best one • Useful for library generators • Modeling: build models of machine using analytical or machine learning techniques and use model to find best program variant • Both approaches require heuristic search over the space of program variants

  28. Deductive synthesis • Systems we have discussed so far perform deductive synthesis • Input is a complete specification of the computation • Knowledge base of domain properties and machine architecture in system • Use knowledge base to lower the level of program and optimize it • Difference from classical compilers • Classical compilers use some fixed sequence of transformations to generate code • Simple analytical models for performance • No notion of searching over program variants

  29. Inductive synthesis • Starting point is incomplete specification of what is to be computed • We will study several approaches • Programming by examples: Gulwani (MSR) and others • Sketching: Solar-Lezama (MIT), Bodik(UW) • English language specifications: Gulwani (MSR), Dillig (UT),…

  30. Programming by examples • Given a set of input-output values, guess the function • Think about regression in machine learning • Checking correctness of function • Ask user • SAT/SMT solver for some problem domains • Success story: Flashfill (Gulwani) • Automatically generates Excel spread-sheet macros from a small number of input-output tuples • User decides whether to accept synthesized macro or not

  31. Sketching • Given: • Specification • Program with “holes” • Holes can be filled in with values from some finite set of choices • Determines the space of candidate programs • Counterexample-guided inductive synthesis (CEGIS) • Guess values for holes • Check against specification • If check fails, refine program and repeat

  32. Sketching example Write a function to find the rightmost 0-bit in word with W bits: Specification: 1010 0111 -> 0000 1000 Sketch: Solution:

  33. English language specifications • Informal specifications • Programming assignments in introductory CS courses • Pseudocode for algorithms in books or papers • If you have the right prerequisites, you usually have no problem understanding how to produce an implementation from such a information specification • Challenge • Can you write a program that is at least as smart as a CS freshman? • “Turing test”: read CS 1 assignments at UT, write all the programs, do the exams, and get a grade of B or better • Small number of papers in this area

  34. To-do items for you • Class website has papers and discussion order • Subject to change • I will assign presentation assignments in the next few days • Use the papers as a starting point for your study • If you find other papers that you would like us to read before your presentation, let Michael know and he will add links • Take a look at papers to get an idea of what we will talk about in course • If you are not yet registered for course, send mail to Michael so we know who you are

More Related