820 likes | 988 Views
Component Frameworks:. Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. Motivation. Parallel Computing in Science and Engineering Competitive advantage Pain in the neck
E N D
Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu PPL-Dept of Computer Science, UIUC
Motivation • Parallel Computing in Science and Engineering • Competitive advantage • Pain in the neck • Necessary evil • It is not so difficult • But tedious, and error-prone • New issues: race conditions, load imbalances, modularity in presence of concurrency,.. • Just have to bite the bullet, right? PPL-Dept of Computer Science, UIUC
But wait… • Parallel computation structures • The set of the parallel applications is diverse and complex • Yet, the underlying parallel data structures and communication structures are small in number • Structured and unstructured grids, trees (AMR,..), particles, interactions between these, space-time • One should be able to reuse those • Avoid doing the same parallel programming again and again • Domain specific frameworks PPL-Dept of Computer Science, UIUC
A Unique Twist • Many apps require dynamic load balancing • Reuse load re-balancing strategies • It should be possible to separate load balancing code from application code • This strategy is embodied in Charm++ • Express the program as a collection of interacting entities (objects). • Let the system control mapping to processors PPL-Dept of Computer Science, UIUC
Multi-partition decomposition • Idea: divide the computation into a large number of pieces • Independent of number of processors • typically larger than number of processors • Let the system map entities to processors PPL-Dept of Computer Science, UIUC
Object-based Parallelization User is only concerned with interaction between objects System implementation User View PPL-Dept of Computer Science, UIUC
Charm Component Frameworks Object based decomposition Reuse of Specialized Parallel Strucutres Load balancing Auto. Checkpointing Flexible use of clusters Out-of-core execn. Component Frameworks PPL-Dept of Computer Science, UIUC
Goals for Our Frameworks • Ease of use: • C++ and Fortran versions • Retain “look-and-feel” of sequential programs • Provide commonly needed features • Application-driven development • Portability • Performance: • Low overhead • Dynamic load balancing via Charm++ • Cache performance PPL-Dept of Computer Science, UIUC
Current Set of Component Frameworks • FEM / unstructured meshes: • “Mature”, with several applications already • Multiblock: multiple structured grids • New, but very promising • AMR: • Oct and Quad-trees PPL-Dept of Computer Science, UIUC
Using the Load Balancing Framework Automatic Conversion from MPI Cross module interpolation Structured FEM MPI-on-Charm Irecv+ Frameworkpath Load database + balancer Migration path Charm++ Converse PPL-Dept of Computer Science, UIUC
Finite Element Framework Goals • Hide parallel implementation in the runtime system • Allow adaptive parallel computation and dynamic automatic load balancing • Leave physics and numerics to user • Present clean, “almost serial” interface: begin time loop compute forces communicate shared nodes update node positions end time loop begin time loop compute forces update node positions end time loop Serial Code for entire mesh Framework Code for mesh partition PPL-Dept of Computer Science, UIUC
FEM Framework: Responsibilities FEM Application (Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize) FEM Framework (Update of Nodal properties, Reductions over nodes or partitions) Partitioner Combiner METIS Charm++ (Dynamic Load Balancing, Communication) I/O PPL-Dept of Computer Science, UIUC
Structure of an FEM Program • Serial init() and finalize() subroutines • Do serial I/O, read serial mesh and call FEM_Set_Mesh • Parallel driver() main routine: • One driver per partitioned mesh chunk • Runs in a thread: time-loop looks like serial version • Does computation and call FEM_Update_Field • Framework handles partitioning, parallelization, and communication PPL-Dept of Computer Science, UIUC
Structure of an FEM Application init() Update driver Update driver Update driver Shared Nodes Shared Nodes finalize() PPL-Dept of Computer Science, UIUC
Framework Calls • FEM_Set_Mesh • Called from initialization to set the serial mesh • Framework partitions mesh into chunks • FEM_Create_Field • Registers a node data field with the framework, supports user data types • FEM_Update_Field • Updates node data field across all processors • Handles all parallel communication • Other parallel calls (Reductions, etc.) PPL-Dept of Computer Science, UIUC
Dendritic Growth • Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid • Adaptive refinement and coarsening of grid involves re-partitioning PPL-Dept of Computer Science, UIUC
Crack Propagation • Explicit FEM code • Zero-volume Cohesive Elements inserted near the crack • As the crack propagates, more cohesive elements added near the crack, which leads to severe load imbalance Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Pictures: S. Breitenfeld, and P. Geubelle PPL-Dept of Computer Science, UIUC
Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle PPL-Dept of Computer Science, UIUC
“Overhead” of Multipartitioning PPL-Dept of Computer Science, UIUC
Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked PPL-Dept of Computer Science, UIUC
Scalability of FEM Framework PPL-Dept of Computer Science, UIUC
Scalability of FEM Framework 3.1M elements 1.5M Nodes ASCI Red 1 processor time : 8.24 secs 1024 processors time:7.13 msecs PPL-Dept of Computer Science, UIUC
Parallel Collision Detection • Detect collisions (intersections) between objects scattered across processors • Approach, based on Charm++ Arrays • Overlay regular, sparse 3D grid of voxels (boxes) • Send objects to all voxels they touch • Collide voxels independently and collect results • Leave collision response to user code PPL-Dept of Computer Science, UIUC
Collision Detection Speed • O(n) serial performance Single Linux PC 2us per polygon serial performance • Good speedups to 1000s of processors ASCI Red, 65,000 polygons per processor scaling problem (to 100 million polygons) PPL-Dept of Computer Science, UIUC
FEM: Future Plans • Better support for implicit computations • Interface to Solvers: e.g. ESI (PETSC), ScaLAPACK or POOMA’s Linear Solvers • Better discontinuous Galerkin method support • Fully distributed startup • Fully distributed insertion • Eliminate serial bottleneck in insertion phase • Abstraction to allow multiple active meshes • Needed for multigrid methods PPL-Dept of Computer Science, UIUC
Multiblock framework • For collection of structured grids • Older versions: • (Gengbin Zheng, 1999-2000) • Recent completely new version: • Motivated by ROCFLO • Like FEM: • User writes driver subroutines that deal with the life-cycle of a single chunk of the grid • Ghost arrays managed by the framework • Based on registration of data by the user program • Support for “connecting up” multiple blocks • makemblock processes geometry info PPL-Dept of Computer Science, UIUC
Multiblock Constituents PPL-Dept of Computer Science, UIUC
Terminology PPL-Dept of Computer Science, UIUC
Mutiblock structure • Steps: • Feed geometry information to makemblock • Input: top level blocks, number of partitions desired • Output: block file containing list of partitions, and communication structure • Run parallel application • Reads the block file • Initialization of data • Manual and info: • http://charm.cs.uiuc.edu/ppl_research/mblock/ PPL-Dept of Computer Science, UIUC
Multiblock code example: main loop do tStep=1,nSteps call MBLK_Apply_bc_All(grid, size, err) call MBLK_Update_field(fid,ghostWidth,grid,err) do k=sk,ek do j=sj,ej do i=si,ei ! Only relax along I and J directions-- not K newGrid(i,j,k)=cenWeight*grid(i,j,k) & &+neighWeight*(grid(i+1,j,k)+grid(i,j+1,k) & &+grid(i-1,j,k)+grid(i,j-1,k)) end do end do end do PPL-Dept of Computer Science, UIUC
Multiblock Driver subroutine driver() implicit none include 'mblockf.h’ … call MBLK_Get_myblock(blockNo,err) call MBLK_Get_blocksize(size,err) ... call MBLK_Create_field(& &size,1, MBLK_DOUBLE,1,& &offsetof(grid(1,1,1),grid(si,sj,sk)),& &offsetof(grid(1,1,1),grid(2,1,1)),fid,err) ! Register boundary condition functions call MBLK_Register_bc(0,ghostWidth, BC_imposed, err) … Time Loop end PPL-Dept of Computer Science, UIUC
Multiblock: Future work • Support other stencils • Currently diagonal elements are not used • Applications • We need volunteers! • We will write demo apps ourselves PPL-Dept of Computer Science, UIUC
Adaptive Mesh Refinement • Used in various engineering applications where there are regions of greater interest • e.g.http://www.damtp.cam.ac.uk/user/sdh20/amr/amr.html • Global Atmospheric modeling • Numerical Cosmology • Hyperbolic partial differential equations (M.J. Berger and J. Oliger) • Problems with uniformly refined meshes for above • Grid is too fine grained thus wasting resources • Grid is too coarse thus the results are not accurate PPL-Dept of Computer Science, UIUC
AMR Library • Implements the distributed grid which can be dynamically adapted at runtime • Uses the arbitrary bit indexing of arrays • Requires synchronization only before refinement or coarsening • Interoperability because of Charm++ • Uses the dynamic load balancing capability of the chare arrays PPL-Dept of Computer Science, UIUC
Node or root Leaf Virtual Leaf 0,0,0 0,0,2 0,1,2 1,0,2 1,1,2 0,0,4 0,1,4 1,0,4 1,1,4 0,2,4 0,3,4 1,2,4 1,3,4 Indexing of array elements Question: Who are my neighbors • Case of 2D mesh (4x4) PPL-Dept of Computer Science, UIUC
Indexing of array elements (contd.) • Mathematicaly: (for 2D) if parent is x,y using n bits then, child1 – 2x , 2y using n+2 bits child2 – 2x ,2y+1 using n+2 bits child3 – 2x+1, 2y using n+2 bits child4 – 2x+1,2y+1 using n+2 bits PPL-Dept of Computer Science, UIUC
Pictorially 0,0,4 PPL-Dept of Computer Science, UIUC
Communication with Nbors • In dimension x the two nbors can be obtained by - nbor --- x-1 where x is not equal to 0 + nbor --- x+1 where x is not equal to 2n • In dimension y the two nbors can be obtained by - nbor --- y-1 where y is not equal to 0 + nbor--- y+1 where y is not equal to 2n PPL-Dept of Computer Science, UIUC
Case 1 Nbors of 1,1,2 Y dimension : -nbor 1,0,2 Case 2Nbors of 1,1,2X dimension : -nbor 0,1,2 Case 3 Nbors of 1,3,4 X dimension : +nbor 2,3,4 Nbor of 1,2,4 X Dimension : +nbor 2,2,4 0,0,0 0,0,2 0,1,2 1,0,2 1,1,2 0,0,4 0,1,4 1,0,4 1,1,4 0,2,4 0,3,4 1,2,4 1,3,4 PPL-Dept of Computer Science, UIUC
Communication (contd.) • Assumption : The level of refinement of adjacent cells differs at maximum by one (requirement of the indexing scheme used) • Indexing scheme is similar for 1D and 3D cases PPL-Dept of Computer Science, UIUC
AMR Interface • Library Tasks - Creation of Tree - Creation of Data at cells - Communication between cells - Calling the appropriate user routines in each iteration - Refining – Refine on Criteria (specified by user) • User Tasks - Writing the user data structure to be kept by each cell - Fragmenting + Combining of data for the Neighbors - Fragmenting of the data of the cell for refine - Writing the sequential computation code at each cell PPL-Dept of Computer Science, UIUC
Some Related Work • PARAMESH Peter MacNeice et al. http://sdcd.gsfc.nasa.gov/RIB/repositories/inhouse_gsfc/Users_manual/amr.htm • This library is implemented in Fortran 90 • Supported on CrayT3E and SGI’s • Parallel Algorithms for Adaptive Mesh Refinement, Mark T. Jones and Paul E. Plassmann, SIAM J. on Scientific Computing, 18,(1997) pp. 686-708. (Also MSC Preprint p 421-0394. ) http://www-unix.mcs.anl.gov/sumaa3d/Papers/papers.html • DAGH-Dynamic Adaptive Grid Hierarchies • By Manish Parashar & James C. Browne • In C++ using MPI PPL-Dept of Computer Science, UIUC
Future work • Specialized version for structured grids • Integration with multiblock • Fortran interface • Current version is C++ only • unlike FEM and Multiblock frameworks, which support Fortran 90 • Relatively easy to do PPL-Dept of Computer Science, UIUC
Summary • Frameworks are ripe for use • Well tested in some cases • Questions and answers: • MPI libraries? • Performance issues? • Future plans: • Provide all features of Charm++ PPL-Dept of Computer Science, UIUC
AMPI: Goals • Runtime adaptivity for MPI programs • Based on multi-domain decomposition and dynamic load balancing features of Charm++ • Minimal changes to the original MPI code • Full MPI 1.1 standard compliance • Additional support for coupled codes • Automatic conversion of existing MPI programs AMPIzer Original MPI Code AMPI Code AMPI Runtime PPL-Dept of Computer Science, UIUC
Charm++ • Parallel C++ with Data Driven Objects • Object Arrays/ Object Collections • Object Groups: • Global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Mature, robust, portable • http://charm.cs.uiuc.edu PPL-Dept of Computer Science, UIUC
Data driven execution Scheduler Scheduler Message Q Message Q PPL-Dept of Computer Science, UIUC
Load Balancing Framework • Based on object migration and measurement of load information • Partition problem more finely than the number of available processors • Partitions implemented as objects (or threads) and mapped to available processors by LB framework • Runtime system measures actual computation times of every partition, as well as communication patterns • Variety of “plug-in” LB strategies available PPL-Dept of Computer Science, UIUC
Load Balancing Framework PPL-Dept of Computer Science, UIUC
Building on Object-based Parallelism • Application induced load imbalances • Environment induced performance issues: • Dealing with extraneous loads on shared m/cs • Vacating workstations • Automatic checkpointing • Automatic prefetching for out-of-core execution • Heterogeneous clusters • Reuse: object based components • But: Must use Charm++! PPL-Dept of Computer Science, UIUC