1 / 24

Introducing Tpetra and Kokkos

Introducing Tpetra and Kokkos. Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI. Introducing Tpetra and Kokkos. Tpetra provides a next generation implementation of the Petra Object Model. This is a framework for distributed linear algebra objects. Tpetra is a successor Epetra.

kitty
Download Presentation

Introducing Tpetra and Kokkos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introducing Tpetra and Kokkos Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

  2. Introducing Tpetra and Kokkos • Tpetra provides a next generation implementation of the Petra Object Model. • This is a framework for distributed linear algebra objects. • Tpetra is a successor Epetra. • Kokkos is an API for programming to a generic parallel node. • Kokkos memory model allows code to be targeted to traditional ( CPU ) and non-traditional ( accelerated ) nodes. • Kokkos computational model provides a set of constructs for parallel computing operations.

  3. Tpetra Organization • Tpetra follows the Petra Object Model currently implemented in Epetra: • Map describes the distribution of object data across nodes. • Teuchos::Comm abstracts internode communication. • Import, Export, Distributor utility classes facilitate efficient data transfer. • Operator, RowMatrix, RowGraph provide abstract interfaces. • Vector, MultiVector, CrsGraph, CrsMatrix are concrete implementations that are the workhorse of Tpetra-centered codes. • Any class with significant data is templated. • Any class with significant computation uses Kokkos.

  4. Tpetra vs. Epetra • Most of the functionality of Epetra is present in Tpetra. • Some differences prohibit a “find-replace” migration:

  5. Tpetra Templated Classes • A limitation of Epetra is that the implementation is tied to double and int. • Deployment of Epetra discourages significant modifications. • Published interface limits the possible implementation changes. • Clean slate and compiler availability allow Tpetra to address this via template parameters to classes. • This provides numerous capability extensions: • No 4GB limit: surpassing intenables arbitrarily large problems. • Arbitrary scalar types: float, complex, matrix<5,3>, qd_real • Greater efficiency.

  6. Tpetra Basic Template Parameters • Three primary template arguments: • LocalOrdinal, GlobalOrdinal, Scalar • Scalar enables the description of numerical objects over different fields. • Any mathematically well-defined type is supported. • Additionally, require support under Teuchos::ScalarTraitsand Teuchos::SerializationTraits. • LocalOrdinal describes local element indices. • Intended to enable efficiency; should be chosen as small as possible. • GlobalOrdinal describes global element indices. • Intended to enable larger problem sizes. • Decoupling necessary when the number of nodes is large.

  7. Tpetra Template Examples Map<LocalOrdinal, GlobalOrdinal> • global_size_t getGlobalNumElements() • size_t getNodeNumElements() • LocalOrdinal getLocalElement(GlobalOrdinalgid) • GlobalOrdinal getGlobalElement(LocalOrdinallid) CrsMatrix<Scalar, LocalOrdinal, GlobalOrdinal> • global_size_t getGlobalNumEntries() • size_t getNodeNumEntries() • void getGlobalRowView(GlobalOrdinalgid, ArrayRCP<GlobalOrdinal> &inds, ArrayRCP<Scalar> &vals)

  8. Tpetra Advanced Template Parameters • Other template arguments exist to provide additional flexibility in Tpetra object implementation: • Node template argument specifies a Kokkos node. • Local data structures and implementations also flexible.

  9. Kokkos Parallel Node API • Want: minimize the effort needed to port Tpetra • The goal of Kokkos is to allow code, once written, to be run on any parallel node, regardless of architecture. • Difficulties are many  • Difficulty #1: Many different memory architectures • Node may have multiple, disjoint memory spaces. • Optimal performance may require special memory placement. • Difficulty #2: Kernels must be tailored to architecture • Implementation of optimal kernel will vary between archs • No universal binary  need for separate compilation paths

  10. Kokkos Node API • Kokkos provides two components: • Kokkos memory model addresses Difficulty #1 • Allocation, deallocation and efficient access of memory • compute buffer: special memory allocation used exclusively for parallel computation • Kokkos compute model addresses Difficulty #2 • Description of kernels for parallel execution on a node • Provides stubs for common parallel work constructs • Parallel for loop • Parallel reduction • Supporting a new platform only a matter of implementing these models, i.e., implementing a new Node object.

  11. Kokkos Memory Model • A generic node model must at least • support the scenario involving distinct memory regions • allow efficient memory access under traditional scenarios • Node provides the following memory handling routines: ArrayRCP<T> Node::allocBuffer<T>(size_t sz); void Node::copyToBuffer<T>(ArrayView<T> src, ArrayRCP<T> dest); void Node::copyFromBuffer<T>(ArrayRCP<T> src, ArrayView<T> dest); ArrayRCP<T> Node::viewBuffer<T> (ArrayRCP<T> buff); void Node::readyBuffer<T>(ArrayRCP<T> buff);

  12. Kokkos Compute Model • Have to find the correct level for programming the node: • Too low: code dot(x,y) for each node • Too much work to move to a new platform. • Effort of writing dot() duplicates that of norm1() • Too high: code dot(x,y) for all nodes. • Can’t exploit hardware features. • API becomes a programming language without a compiler. • Somewhere in the middle: • Parallel reduction is the intersection of dot() and norm1() • Parallel for loop is the intersection of axpy() and mat-vec • We need a way of fusing kernels with these basic constructs. m kernels * n nodes = m*n m kernels + 2 constructs * n nodes = m + 2 * n

  13. Kokkos Compute Model • Template meta-programming is the answer. • This is the same approach that Intel TBB takes. • Node provides generic parallel constructs • Node::parallel_for, Node::parallel_reduce • User fills the holes in the generic construct.

  14. Nodes and Kernels:How it comes together • Kokkos developer/Vendor/Hero develops nodes: • User develops kernels for parallel constructs. • Template meta-programming does the rest: • TBBNode< DotOp<double> >::parallel_reduce • CUDANode< ComputePotentials<3D,LJ> >::parallel_for • Composition is compile-time • OpenMPNode + AxpyOp equivalent to hand-coded OpenMPAxpy. • May not always be able to achieve this feat.

  15. Kokkos Linear Algebra Library • A subpackage of Kokkos providing a set of data structures and kernels for local parallel linear algebra objects. • Coded to the Kokkos Parallel Node API • Tpetra (global) objects consist of a Comm and a corresponding (local) Kokkos object. • Implementing a new Node ports Tpetra without any changes to Tpetra. T Tpetra::Vector<T>::dot(Tpetra::Vector<T> v) { T lcl = this->lclVec_->dot( v.lclVec_ ); return comm_->reduceAll<T>(SUM, lcl); }

  16. Teuchos MemoryManagement SuiteA User Perspective Chris Baker/ORNL TUG 2009 November 3-5 @ CSRI

  17. Teuchos Memory Management • The Teuchos utility package provides a number of memory management classes: • RCP: reference counted pointer • ArrayRCP: reference counted array • ArrayView: encapsulates the length of and pointer to an array • Array: dynamically sized array • Tpetra/Kokkos utilize these classes in place of raw pointers for: • writing bug-free code • writing simple code with simple interfaces

  18. Teuchos::RCP • RCP is a reference-counted smart pointer • Provides runtime protection against null dereference • Provides automatic garbage collection • Necessary in the context of exceptions. • Semantics are those of C pointer • Tpetra use: • Tracking the ownership of dynamically created objects • Tpetra::Map objects always passed by RCP. • Dynamically created objects always encapsulated in RCP: • RCP<Vector> Vector::getSubView(...) • Non-persisting situations allow efficient Teuchos::Ptr.

  19. Teuchos::ArrayRCP • ArrayRCP is a reference-counted smart array • T* holds double duty in C: pointer and pointer to array • RCP is for the former; ARCP is for the latter • Semantics are those of C array/pointer • access operators: [] * -> • arithmetic operators: + - ++ -- += -= • all operations are bounds-checked in debug mode • iterators are available for optimal release performance • Tpetra/Kokkos use: • Allocated arrays always encapsulated in ARCP before return. • Used heavily in Kokkos for compute buffers and their views.

  20. Example: ARCP and Kokkos Buffers • The use of Teuchos::ArrayRCP greatly simplifies the management of compute buffers in the Kokkos memory model. • In the absence of a smart pointer, the Node would need to provide a deleteBuffer() method as well. • Would need to be manually called by user. • This requires the ability to identify when the buffer can be freed. • ArrayRCP allows Node to register a custom, Node-appropriate deallocator and additional bookkeeping data. ArrayRCP<T> Node::allocBuffer<T>(size_t sz);

  21. Example: ARCP and Kokkos Buffers ArrayRCP<T> Node::viewBuffer<T>(ArrayRCP<T> buff); • In the absence of ArrayRCP, this method requires that the user “release” the view to enable any necessary write-back to device memory. • This requires manually tracking when the view has expired. • Instead, Node can register a custom deallocator for the ArrayRCP that will perform the write-back or other necessary bookkeeping. • This is especially helpful in the context of Tpetra. • Tpetra::MultiVector::get1dVew() returns a host view of class data encapsulated in an ArrayRCP with appropriate deallocator. • As a result, Tpetra user isn’t exposed to Kokkos Node and doesn’t have to manually release the view.

  22. Teuchos::ArrayView • RCP is sometimes overkill; non-persisting relationships can get away with Ptr. • Non-persisting relationships of array data similarly utilize the ArrayView class. • This class basically encapsulate a pointer and a size. • Supports a subset of C array semantics • Optimized build results in very fast code. • No garbage collection overhead. • Iterators become C pointers. • Well integrated with other classes • Easily returned by ArrayRCP and Array

  23. Teuchos::Array • Array is a replacement for std::vector. • The benefit of Array is integration with other Teuchos memory classes.

  24. Benefits of use • Initial release of Tpetra contained no pointers: • Replaced by RCP, ArrayRCP or appropriate iterator • Zero memory overhead w.r.t Epetra. • Almost made me a lazier developer  • Debugging abilities are excellent: • Extends beyond normal bounds checking; can put additional constraints on memory access. • Runtime build results in code that is as fast as C. • These memory utilities are unique to Trilinos. • Research-level capability • Production-level quality

More Related