220 likes | 354 Views
PL/B: Programming for locality and large scale parallelism. George Alm á si Luiz A. DeRose José E. Moreira David A. Padua. Overview. Concepts Examples Thoughts about implementation Conclusions. A programming system for distributed-memory machines: Focus: numerical computing .
E N D
PL/B: Programming for locality and large scale parallelism George Almási Luiz A. DeRose José E. Moreira David A. Padua
Overview • Concepts • Examples • Thoughts about implementation • Conclusions
A programming system for distributed-memory machines: Focus: numerical computing. Convenient to use Flat learning curve Short development cycle Easy debugging, maintenance. Not too difficult to implement No “heroic programming” for the compiler Language extension: General nature 1st implementation using MATLAB™ Programming model: Single thread of execution Explicit data layout, distribution Recursive tiling Data distribution primitives Implementation: Master-slave model PL/B at a glance
Not another programming language! Avoid SPMD Difficult to reason about: Global view of communication and computation is not explicit in SPMD model “4D spaghetti code” MPI is cumbersome: No compiler support “Assembly language of parallel computing”. Avoid complex compilers HPF Avoid OpenMP Wrong abstraction for distributed memory machines Could be implemented on top of Treadmarks™-like systems, but Hard to get efficiency Requires compiler support Untested, experimental Things we didn’t want to do
Technical simplicity: No compiler work needed for prototype Popularity: Programmers of parallel machines are familiar with the MATLAB environment Government interest: Parallel MATLAB is part of PERCS project Evaluation strategy: IBM’s BlueGene/L is an ideal testbed for scalability Novelty: MATLAB™ is an excellent language for prototyping conventional algorithms. There is nothing equivalent for parallel algorithms. The Convergence of PL/B and MATLAB™
Constructing HTAs: Bottom-up: Imposing HTA shape onto a flat array Always homogeneous Top-down: Structure first Contents later Maybe non-homogeneous Matlab™ notation: Similar to cell arrays { } n-dimensional tiled arrays d-dimensional tiles, d ≤ n Tiling is recursive Homogeneity of HTAs: Adjacent tiles are “compatible” along dimensions of adjacency Not all tiles have to have the same shape Tiles can be distributed Across modules of a parallel system. Distribution is always block cyclic Hierarchically Tiled Arrays
x2 = A{:}{2:4,3}(1:2) x3 = A{1}(1:4,1:6) x1 = A{2}{4,3}(3) x4 = A(2,9:11) “flattened” access Creating and Accessing HTAs A = hta {1:2}{1:4,1:3}(1:3)
Blocked: • HTA shape: {1:3,1:3}(1:5,1:4) • Block-cyclic in 2nd dimension: • HTA shape: {1:3,1:6}(1:5,1:2) Distributing HTAs across processors • 3x3 mesh of processors, 15x12 array
PL/B programs are single-threaded and contain array operations on HTAs. The host running PL/B is a front for a distributed machine Processors are arranged in hierarchical meshes. Top levels of HTAs distribute onto a subset of existing nodes. Computation statements: all HTA indices refer to the same (local) physical processor In particular, when all HTA indices are identical, computations are guaranteed to be local Communication: all other statements Some functions and operators encode both communicatoin and computation Typically, MPI-like collective operations Summary: Parallel Communication and Computation in PL/B
Overview • Concepts • Examples • Thoughts about 1st implementation • Conclusions
Tiled Matrix Multiplication for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 c(i,j)=c(i,j)+a(i,k)*b(k,j); end end end end end end
c{i,j}, a{i,k}, b{k,j},andT represent HTA tiles (submatrices). The * operator represents matrix multiplication on HTA tiles. Tiled Matrix Multiplication (PL/B) for i=1:m for j=1:m T=0; for k=1:m T=T+a{i,k}*b{k,j}; end c{i,j}=T; end end
Cannon’s Algorithm written down in PL/B function [c] = cannon(a,b) % a, b are assumed to be distributed on an n*n grid. % create an n*n distributed hta for matrix c. c{1:n,1:n} = zeros(p,p); % communication % “parallelogram shift” rows of a, columns of b for i=2:n a{i:n,:} = cshift(a(i:n,:},dim=2,shift=1); % communication b{:,i:n} = cshift(b{:,i:n},dim=1,shift=1); % communication end % main loop: parallel multiplications, column shift a, row shift b for k=1:n c{:,:} = c{:,:}+a{:,:}*b{:,:}; % computation a{:,:} = cshift(a{:,:},dim=2, shift=1); % communication b{:,:} = cshift(b{:,:},dim=1,shift=1); % communication end end
Sparse Parallel Matrix-Vector Multiply with vector copy P1 P2 A b × P3 A: distributed b: copied P4
Sparse Parallel MVM with vector copy % Distribute a forall i=1:n, c{i} = a(DIST(i):DIST(i+1)-1,:); end % Broadcast vector b v{1:n} = b; % Local multiply (sparse) t{:} = c{:} * v{:}; % Everybody gets copy of result forall i=1:N v{i}= t(:); %flattened t end Important observation: In MATLAB sparse computations can be represented as dense computations. The interpreter only performs the necessary operations.
Overview • Concepts • Examples • Thoughts about implementation • Conclusions
Implementation • A “Distributed Array Virtual Machine” implemented on backed nodes • Multiple types of memory (local, shared, co-arrays etc) • Similar to UPC, OpenMP runtimes • DAVM instruction set (bytecode?) • A MATLAB™ based frontend • The MATLAB interpreter runs the show • HTA code can be compiled into AVM code and distributed to backend • A MATLAB “toolbox” contains the new data types • Possible changes to MATLAB syntax: as few as we can get away • forall
Implementation: MATLAB™-based frontend Matlab @hta directory operators collectives indexing constructors hta * subsref sum tile spread / subsasgn cshift \
Q: is PL/B a toy language? A: it is as expressive as SPMD Subsumes a large part of MPI: a{1} = b{2} is a message sent from rank 2 to 1. x = sum(a{:}) MPI_Reduce x{:} = a MPI_Bcast Many important algorithms can be formulated “better” Q: Is PL/B still Matlab™ or a new beast? PL/B defines a new data type and operators MATLAB is a polymorphic language: New data type is compatible (drop-in replacement) with existing data types New data types bring new functionality Think “toolbox” – Matlab users are familiar with the concept Porting code to PL/B: Changes are going to be fairly localized The code will keep working during transition Anticipating questions:
Q: Debugging and profiling PL/B A: Debugging PL/B should not be different from debugging a regular MATLAB program. Q: Performance? A: PL/B has a better chance of scaling than a regular MPI program Most communication primitives are high-level and are going to be optimized. Writing low-level communication code in PL/B is possible, but not a natural thing to do Implementation easy for most primitives (MPI) More questions
Conclusion • New and exciting paradigm: • HTA arrays and operators express communication and computation. • Single-threaded code • Master-slave execution model • Anticipate scalability • Minimal to no compiler work needed • About to embark on 1st implementation • Runtime (Distributed Array Virtual Machine) • Interpreted front-end using (unchanged) Matlab™