Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Sub-trees from decomposition of function space Dashed arrows depict the flow of data between operations on tree nodes. The input data to a node operation is indicated by an arrow pointing into the node, and the output from the operation is shown by an outgoing arrow. Function interval Coefficient tensor Tree node Background Parallel Programming Challenges • Concurrency • Operations on tree nodes are created as asynchronous tasks • Tasks are chained together in a hierarchical nested manner to express recursive parallelism in tree algorithms • Dependencies are handled by letting tasks synchronize on the completion of spawned tasks • Locality control permits running tasks using owner-computes policy which will launch tasks where target data resides • Plan to support work-stealing technique for dynamic load balancing of a distributed set of tasks in the language runtime const myFTree = new FTree(order=5); def walkDownOp(node: Node) { /* Perform the operation on node */ for child in node.getChild() do on myFTree.node2loc(child) do begin walkDownOp(child); } syncon myFTree.node2loc(root) do begin walkDownOp(root); def walkUpOp(node: Node) { coforall child in node.getChild() do on myFTree.node2loc(child) do walkUpOp(child); /* Perform the operation on node */ } syncon myFTree.node2loc(root) do begin walkUpOp(root); • Multiresolution Analysis (MRA) • Mathematical technique of function approximation • Representation is a hierarchy of coefficients • Dynamically adapts to guarantee the accuracy of the approximation • Varying degree of information granularity in the hierarchy • Trade numerical accuracy with computation time • Shared Memory Model • Cilk-style fork-join task parallelism with a work-stealing runtime doing dynamic load balancing is a plausible solution, but… • it is not targeted at distributed data • the model mandates that a parent task await the completion of children tasks, which constrains the full expression of available parallelism • Message Passing Model • Hard to express dynamic, irregular computations • Two-sided communication model introduces unnecessary overhead in reading and writing distributed tree data • Does not address the need for dynamic load balancing • Partitioned Global Address Space (PGAS) Models • i.e. Co-Array Fortran, UPC, Titanium • Static SPMD model of parallelism, lack flexible threading capabilities • Maintenance of distributed data structures without remote operations requires complex low-level remote memory references • Does not address the need for dynamic load balancing MRA can represent different areas in an object at different levels of detail. • Coefficient tensor arithmetic • Chapel provides ZPL-style “array language” to simplify working with multi-dimensional arrays • Domains, a first-class language concept denoting an index set, define the size and shape of arrays, and support data parallel iteration in creating and slicing arrays • A range of array operators facilitate parallel operations on whole arrays or array slices eliminating need for tedious array indexing • Mathematical functions involving coefficient tensors are easily expressed in the array language /* Overloaded multiplication operator to perform vector-matrix multiplication */ def *(V: [] real, M: [] real) where V.rank == 1 && M.rank == 2 { const R: [i in M.domain.dim(2)] real = + reduce (V * M(..,i)); return R; } /* Overloaded multiplication operator to perform matrix-vector multiplication */ def *(M: [] real, V: [] real) where M.rank == 2 && V.rank == 1 { const R: [i in M.domain.dim(1)] real = + reduce (M(i,..) * V); return R; } /* Norm of a vector */ def normf(V) where V.rank == 1 { return sqrt(+ reduce V**2); } What is (the) MADNESS (about)? • Multiresolution Adaptive Numerical Environment for Scientific Simulation • Programming environment for the solution of integral and differential equations • Built on adaptive multiresolution analysis in multiwavelet basis and low separation rank methods for scaling to higher dimensions • Fast algorithms with guaranteed precision • Trade precision for speed • High-level composition of numerical codes • Work with functions and operators • Target applications • Quantum chemistry, atomic and molecular physics, material science, nuclear structure Chapel Programming Model • Multithreaded parallel programming • Global view of computation, data structures • Abstractions for data and task parallelism • data: domain, forall, iterators • task: begin, cobegin, coforall, sync variables, atomic • Composition of parallelism • Virtualization of threads • Locality-aware programming • locale:machineunitof storage and processing • domains may be distributed across locales • on keyword binds computation to locale(s) • Object-oriented programming • OOP can help manage program complexity • Classes and objects are provided in Chapel, but their use is typically not required • Advanced language features (e.g. distributions) expressed using classes • Generic programming and type inference • Type parameters • Latent types • Variables are statically-typed • Global-view container • Container to store the tree, presents a global-view and one-sided access to tree algorithms • Internally, maintains a directory of the distributed collection of sub-trees and transparently maps an indexed node to the host locale • Sub-trees are structured as associative arrays of node-coefficients key-value pairs • The mapping scheme could be a simple hash function or driven by a specialized partitioning strategy having better locality properties class FTree { const tree: [LocaleSpace] SubTree; def FTree(order: int) { coforall loc in Locales do on loc do tree[loc.id]=new SubTree(order); } defthis(node: Node) { const t = tree[node2loc(node).id]; return t[node]; } /*Global tree access methods*/ } class SubTree { const coeffDom: domain(1); var nodes: domain(Node); var coeff: [nodes] [coeffDom] real; /* Local tree access methods */ } • Python-like high-level programming • Better usability in writing end user-codes using functions and operators rather than their underlying representations • Key language ideas are object-oriented and type inferencing features • Future work will explore writing dimension-independent programs /* Fn_Test1 class wraps an analytic function */ var f = new Fn_Test1(); /* Functionclass holds the numerical tree form */ var F = new Function(k=5,thresh=1e-5,f=f); /*Overloaded arithmetic operators on Function class that invoke various tree algorithms */ var H = F + F; H = F * F; /*Print the numerical value at a given point */ writeln("Numerical value: ", format(" %0.8f", F(0.5))); A molecular orbital of the benzene molecule with the adaptive mesh also displayed. Implementation Issues with MADNESS • Multi-dimensional tree distribution • Multiresolution adaptive properties produce unbalanced coefficient trees (binary tree in 1-d, quad tree in 2-d, oct tree in 3-d etc.) • Tree structure evolves in unscheduled ways due to very flexible adaptive refinement • Need a scheme to partition the complete tree as the entire tree, and not just the leaf nodes, is utilized in some algorithms • Two main types of applications • Few large trees (billions of nodes) - time evolution of wavepackets in molecular physics • Many (thousands) smaller trees (millions of nodes) - materials and electronic structure • Nodes range from 1KB-1MB • Algorithmic characteristics • Some are recursive, tree-walking algorithms that move data up/down the tree structure • Number of levels navigated varies dynamically, and may constitute a data dependence chain • Some move data laterally within same level of tree i.e. between neighboring nodes • Some algorithms involve applying mathematical functions to the collection of coefficient tensors, and possibly combining individual results • Certain algorithms operate on multiple trees having different refinement characteristics and produce a new tree • Need to express and manage huge amounts of nested hierarchical concurrency on distributed many-core petascale machines Putting the Pieces Together Binary tree numerical form of a 1-d analytical function. Note that some intervals are not sub-divided due to the adaptive nature of refinement. • Parallel compress algorithm • Parallel refine algorithm def refine(node = root) { const child = node.getChild(); var sc: [0..2*k-1] real; sc[0..k-1] = project(child(1)); sc[k..2*k-1] = project(child(2)); const dc = sc*hgT; const nf = normf(dc[k..2*k-1]); if (nf < thresh) { sumC[child(1)] = sc[0..k-1]; sumC[child(2)] = sc[k..2*k-1]; } else { onsumC.node2loc(child(1)) do begin refine(child(1)); onsumC.node2loc(child(2)) do begin refine(child(2)); } } synconsumC.node2loc(node) do begin refine(node); def compress(node = root) { const child = node.getChild(); cobegin { onsumC.node2loc(child(1)) do if !sumC.hasCoeffs(child(1)) then compress(child(1)); onsumC.node2loc(child(2)) do if !sumC.hasCoeffs(child(2)) then compress(child(2)); } var sc: [0..2*k-1] real; sc[0..k-1] = sumC[child(1)]; sc[k..2*k-1] = sumC[child(2)]; const dc = sc*hgT; sumC[node] = dc[0..k-1]; diffC[node] = dc[k..2*k-1]; sumC.remove(child(1)); sumC.remove(child(2)); } synconsumC.node2loc(root) do begin compress(root); Solution Building Blocks A coefficient tensor that is added to the tree during a tree algorithm. sumC and diffC are global-view tree containers Future Work • Recursive task parallelism • Language constructs to distinguish between tasks that may vs. must run in parallel • “May” construct permits runtime management of parallelism • “Must” construct for patterns like producer-consumer • Application vs. compiler control over the granularity and degree of parallelism • DAG-based dynamic scheduling of tasks inside Chapel runtime • Provide executionguarantees for a class of DAGs • Compiler support for parallel distribution/iteration/algebraic operations on associative array structures Multiresolution Analysis, Computational Chemistry, and Implications for High Productivity Parallel Programming Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan Solution Building Blocks A coefficient tensor that is removed from the tree after it has been operated upon. Compress tree algorithm Adaptively refined tree Reconstruct tree algorithm X Research sponsored in part by the Laboratory Directed Research and Development Program and Post Masters Research Participation Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725, and DOE grant #DE-FC02-06ER25755 and NSF grant #0403342. Multiplication of differently refined trees

Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Aniruddha G. Shet, James Dinan, Robert J. Harrison, and P. Sadayappan

Presentation Transcript

James Robert Cade

ROBERT JAMES SMALLCOMBE

Robert J. Oppenheimer

D A Kiefer , D P Harrison, M G Hinton , E M Armstrong, F J O’Brien

P. Chandrasekaran , A. Shet, K. Ramesh, P.K.Bhavani,

Neil Harrison, David James and Kathryn Last

Dial By J ames and Harrison

James Robert Mann

Jackson G. Egen and James P. Allison

T â i-ôan-j„ J…p-m£g

Robert J. Woodward

Aniruddha Gokhale

J. Robert Oppenheimer

Neil Harrison, David James and Kathryn Last

Robert J. Nemiroff

P Guyon , M Welling, G Frank, P Artaxo, G Nishioka, J Lloyd,

Robert J. Gordon

Shet

Aniruddha Nazre, Nazre Aniruddha