260 likes | 409 Views
Introducing Tpetra and Kokkos. Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI. Introducing Tpetra and Kokkos. Tpetra provides a next generation implementation of the Petra Object Model. This is a framework for distributed linear algebra objects. Tpetra is a successor Epetra.
E N D
Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI
Introducing Tpetra and Kokkos • Tpetra provides a next generation implementation of the Petra Object Model. • This is a framework for distributed linear algebra objects. • Tpetra is a successor Epetra. • Kokkos is an API for programming to a generic parallel node. • Kokkos memory model allows code to be targeted to traditional ( CPU ) and non-traditional ( accelerated ) nodes. • Kokkos computational model provides a set of constructs for parallel computing operations.
Tpetra Organization • Tpetra follows the Petra Object Model currently implemented in Epetra: • Map describes the distribution of object data across nodes. • Teuchos::Comm abstracts internode communication. • Import, Export, Distributor utility classes facilitate efficient data transfer. • Operator, RowMatrix, RowGraph provide abstract interfaces. • Vector, MultiVector, CrsGraph, CrsMatrix are concrete implementations that are the workhorse of Tpetra-centered codes. • Any class with significant data is templated. • Any class with significant computation uses Kokkos.
Tpetra vs. Epetra • Most of the functionality of Epetra is present in Tpetra. • Some differences prohibit a “find-replace” migration:
Tpetra Templated Classes • A limitation of Epetra is that the implementation is tied to double and int. • Deployment of Epetra discourages significant modifications. • Published interface limits the possible implementation changes. • Clean slate and compiler availability allow Tpetra to address this via template parameters to classes. • This provides numerous capability extensions: • No 4GB limit: surpassing intenables arbitrarily large problems. • Arbitrary scalar types: float, complex, matrix<5,3>, qd_real • Greater efficiency.
Tpetra Basic Template Parameters • Three primary template arguments: • LocalOrdinal, GlobalOrdinal, Scalar • Scalar enables the description of numerical objects over different fields. • Any mathematically well-defined type is supported. • Additionally, require support under Teuchos::ScalarTraitsand Teuchos::SerializationTraits. • LocalOrdinal describes local element indices. • Intended to enable efficiency; should be chosen as small as possible. • GlobalOrdinal describes global element indices. • Intended to enable larger problem sizes. • Decoupling necessary when the number of nodes is large.
Tpetra Template Examples Map<LocalOrdinal, GlobalOrdinal> • global_size_t getGlobalNumElements() • size_t getNodeNumElements() • LocalOrdinal getLocalElement(GlobalOrdinalgid) • GlobalOrdinal getGlobalElement(LocalOrdinallid) CrsMatrix<Scalar, LocalOrdinal, GlobalOrdinal> • global_size_t getGlobalNumEntries() • size_t getNodeNumEntries() • void getGlobalRowView(GlobalOrdinalgid, ArrayRCP<GlobalOrdinal> &inds, ArrayRCP<Scalar> &vals)
Tpetra Advanced Template Parameters • Other template arguments exist to provide additional flexibility in Tpetra object implementation: • Node template argument specifies a Kokkos node. • Local data structures and implementations also flexible.
Kokkos Parallel Node API • Want: minimize the effort needed to port Tpetra • The goal of Kokkos is to allow code, once written, to be run on any parallel node, regardless of architecture. • Difficulties are many • Difficulty #1: Many different memory architectures • Node may have multiple, disjoint memory spaces. • Optimal performance may require special memory placement. • Difficulty #2: Kernels must be tailored to architecture • Implementation of optimal kernel will vary between archs • No universal binary need for separate compilation paths
Kokkos Node API • Kokkos provides two components: • Kokkos memory model addresses Difficulty #1 • Allocation, deallocation and efficient access of memory • compute buffer: special memory allocation used exclusively for parallel computation • Kokkos compute model addresses Difficulty #2 • Description of kernels for parallel execution on a node • Provides stubs for common parallel work constructs • Parallel for loop • Parallel reduction • Supporting a new platform only a matter of implementing these models, i.e., implementing a new Node object.
Kokkos Memory Model • A generic node model must at least • support the scenario involving distinct memory regions • allow efficient memory access under traditional scenarios • Node provides the following memory handling routines: ArrayRCP<T> Node::allocBuffer<T>(size_t sz); void Node::copyToBuffer<T>(ArrayView<T> src, ArrayRCP<T> dest); void Node::copyFromBuffer<T>(ArrayRCP<T> src, ArrayView<T> dest); ArrayRCP<T> Node::viewBuffer<T> (ArrayRCP<T> buff); void Node::readyBuffer<T>(ArrayRCP<T> buff);
Kokkos Compute Model • Have to find the correct level for programming the node: • Too low: code dot(x,y) for each node • Too much work to move to a new platform. • Effort of writing dot() duplicates that of norm1() • Too high: code dot(x,y) for all nodes. • Can’t exploit hardware features. • API becomes a programming language without a compiler. • Somewhere in the middle: • Parallel reduction is the intersection of dot() and norm1() • Parallel for loop is the intersection of axpy() and mat-vec • We need a way of fusing kernels with these basic constructs. m kernels * n nodes = m*n m kernels + 2 constructs * n nodes = m + 2 * n
Kokkos Compute Model • Template meta-programming is the answer. • This is the same approach that Intel TBB takes. • Node provides generic parallel constructs • Node::parallel_for, Node::parallel_reduce • User fills the holes in the generic construct.
Nodes and Kernels:How it comes together • Kokkos developer/Vendor/Hero develops nodes: • User develops kernels for parallel constructs. • Template meta-programming does the rest: • TBBNode< DotOp<double> >::parallel_reduce • CUDANode< ComputePotentials<3D,LJ> >::parallel_for • Composition is compile-time • OpenMPNode + AxpyOp equivalent to hand-coded OpenMPAxpy. • May not always be able to achieve this feat.
Kokkos Linear Algebra Library • A subpackage of Kokkos providing a set of data structures and kernels for local parallel linear algebra objects. • Coded to the Kokkos Parallel Node API • Tpetra (global) objects consist of a Comm and a corresponding (local) Kokkos object. • Implementing a new Node ports Tpetra without any changes to Tpetra. T Tpetra::Vector<T>::dot(Tpetra::Vector<T> v) { T lcl = this->lclVec_->dot( v.lclVec_ ); return comm_->reduceAll<T>(SUM, lcl); }
Teuchos MemoryManagement SuiteA User Perspective Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI
Teuchos Memory Management • The Teuchos utility package provides a number of memory management classes: • RCP: reference counted pointer • ArrayRCP: reference counted array • ArrayView: encapsulates the length of and pointer to an array • Array: dynamically sized array • Tpetra/Kokkos utilize these classes in place of raw pointers for: • writing bug-free code • writing simple code with simple interfaces
Teuchos::RCP • RCP is a reference-counted smart pointer • Provides runtime protection against null dereference • Provides automatic garbage collection • Necessary in the context of exceptions. • Semantics are those of C pointer • Tpetra use: • Tracking the ownership of dynamically created objects • Tpetra::Map objects always passed by RCP. • Dynamically created objects always encapsulated in RCP: • RCP<Vector> Vector::getSubView(...) • Non-persisting situations allow efficient Teuchos::Ptr.
Teuchos::ArrayRCP • ArrayRCP is a reference-counted smart array • T* holds double duty in C: pointer and pointer to array • RCP is for the former; ARCP is for the latter • Semantics are those of C array/pointer • access operators: [] * -> • arithmetic operators: + - ++ -- += -= • all operations are bounds-checked in debug mode • iterators are available for optimal release performance • Tpetra/Kokkos use: • Allocated arrays always encapsulated in ARCP before return. • Used heavily in Kokkos for compute buffers and their views.
Example: ARCP and Kokkos Buffers • The use of Teuchos::ArrayRCP greatly simplifies the management of compute buffers in the Kokkos memory model. • In the absence of a smart pointer, the Node would need to provide a deleteBuffer() method as well. • Would need to be manually called by user. • This requires the ability to identify when the buffer can be freed. • ArrayRCP allows Node to register a custom, Node-appropriate deallocator and additional bookkeeping data. ArrayRCP<T> Node::allocBuffer<T>(size_t sz);
Example: ARCP and Kokkos Buffers ArrayRCP<T> Node::viewBuffer<T>(ArrayRCP<T> buff); • In the absence of ArrayRCP, this method requires that the user “release” the view to enable any necessary write-back to device memory. • This requires manually tracking when the view has expired. • Instead, Node can register a custom deallocator for the ArrayRCP that will perform the write-back or other necessary bookkeeping. • This is especially helpful in the context of Tpetra. • Tpetra::MultiVector::get1dVew() returns a host view of class data encapsulated in an ArrayRCP with appropriate deallocator. • As a result, Tpetra user isn’t exposed to Kokkos Node and doesn’t have to manually release the view.
Teuchos::ArrayView • RCP is sometimes overkill; non-persisting relationships can get away with Ptr. • Non-persisting relationships of array data similarly utilize the ArrayView class. • This class basically encapsulate a pointer and a size. • Supports a subset of C array semantics • Optimized build results in very fast code. • No garbage collection overhead. • Iterators become C pointers. • Well integrated with other classes • Easily returned by ArrayRCP and Array
Teuchos::Array • Array is a replacement for std::vector. • The benefit of Array is integration with other Teuchos memory classes.
Benefits of use • Initial release of Tpetra contained no pointers: • Replaced by RCP, ArrayRCP or appropriate iterator • Zero memory overhead w.r.t Epetra. • Almost made me a lazier developer • Debugging abilities are excellent: • Extends beyond normal bounds checking; can put additional constraints on memory access. • Runtime build results in code that is as fast as C. • These memory utilities are unique to Trilinos. • Research-level capability • Production-level quality