1 / 19

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters. Aaron Becker (abecker3@illinois.edu) Isaac Dooley Laxmikant Kale SAAHPC, July 30 2009 Champaign-Urbana, IL. Target Application. Inhomogeneous material simulation

rsmall
Download Presentation

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters • Aaron Becker (abecker3@illinois.edu) • Isaac Dooley • Laxmikant Kale • SAAHPC, July 30 2009 • Champaign-Urbana, IL

  2. Target Application • Inhomogeneous material simulation • 3D finite elements (tetrahedra), explicit structural dynamics • Simple kernels compute forces on each element at every time step • Stiffness matrix varies with location in material--very memory intensive • Existing Charm++/ParFUM application works on traditional clusters

  3. Target Hardware: NCSA Lincoln • 2x Intel Harpertown E5410 CPUs • 1/2 of a Tesla S1070 (2 GPUs) • Infiniband interconnect • 192 nodes • Runs spanning Abe and Lincoln nodes may be possible in the future • Multiple powerful CPUs and GPUs on each node. How do we take advantage of all of it? 3

  4. Approach • Over-decompose the mesh into many partitions per node • Write GPU and CPU implementations of computational kernels • Each partition can be handled by either CPU or GPU • Choose a mapping of partitions to hardware that maximizes utilization • Partitioning, ghost management, and synchronization is handled by ParFUM on the CPU • Goal: flexibility in number/size of partitions and • assignment of partitions to hardware 4

  5. ParFUM Hybrid API • Management of host/device node and element data • Compatible API for writing CPU and GPU kernels • On CPU: loop over nodes or elements using iterators • On GPU: each thread is responsible for one node or element • Functions for inter-partition synchronization

  6. ParFUM Hybrid API • nodeIterator itr; • for (nodeItr_Begin(itr); nodeItr_IsValid(itr); • nodeItr_Next(itr)) { • n_data = node_GetData(itr); • for (int i=0; i<dof; ++i) { • float a_old = n_data->a[i]; • n_data->a[i] = -n_data->F[i] / n_data->mass; • n_data->v[i] += • 0.5 * dt * (n_data->a[i] + a_old); • } • }

  7. ParFUM Hybrid API • nodeIterator itr; • for (nodeItr_Begin(itr); nodeItr_IsValid(itr); • nodeItr_Next(itr)) { • n_data = node_GetData(itr); • for (int i=0; i<dof; ++i) { • float a_old = n_data->a[i]; • n_data->a[i] = -n_data->F[i] / n_data->mass; • n_data->v[i] += • 0.5 * dt * (n_data->a[i] + a_old); • } • }

  8. ParFUM Hybrid API • n_data = node_GPU_GetData(my_node); • for (int i=0; i<dof; ++i) { • float a_old = n_data->a[i]; • n_data->a[i] = -n_data->F[i] / n_data->mass; • n_data->v[i] += • 0.5 * dt * (n_data->a[i] + a_old); • } • }

  9. ParFUM Hybrid API • n_data = node_GPU_GetData(my_node); • for (int i=0; i<dof; ++i) { • float a_old = n_data->a[i]; • n_data->a[i] = -n_data->F[i] / n_data->mass; • n_data->v[i] += • 0.5 * dt * (n_data->a[i] + a_old); • } • }

  10. Managing Data Races • Independent GPU threads introduce races when they update common data structures. • (e.g. nodes updating shared element quantities) • Perform each write into a separate slot in the element data structure, accumulate values in next element kernel. • Possible alternative: graph coloring

  11. Managing GPU Partitions • Each GPU partition needs to synchronize with the host at each step • We expect a large number of GPU partitions, so how will they be managed? CPU Partition CPU Partition CPU Partition CPU Partition CPU Partition CPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition CPU Partition CPU Partition GPU Partitions Normal CPU GPU Manager

  12. Mapping and Load Balance

  13. Mapping and Load Balance CPU Partitions (2 per core)

  14. Mapping and Load Balance GPU Partitions (34 per node)

  15. Optimization • Pack: identify the data needed for synchronization and copy only that data between host and device • Async: run all memory transfers and kernels asynchronously (enables overlap) • Overlap: let the GPU manager cores run CPU partitions while waiting for GPU partitions

  16. Characterizing Performance CPU GPU

  17. Scaling

  18. Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters • Aaron Becker (abecker3@illinois.edu) • Isaac Dooley • Laxmikant Kale • SAAHPC, July 30 2009 • Champaign-Urbana, IL

More Related