Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters • Aaron Becker (abecker3@illinois.edu) • Isaac Dooley • Laxmikant Kale • SAAHPC, July 30 2009 • Champaign-Urbana, IL

Target Application • Inhomogeneous material simulation • 3D finite elements (tetrahedra), explicit structural dynamics • Simple kernels compute forces on each element at every time step • Stiffness matrix varies with location in material--very memory intensive • Existing Charm++/ParFUM application works on traditional clusters

Target Hardware: NCSA Lincoln • 2x Intel Harpertown E5410 CPUs • 1/2 of a Tesla S1070 (2 GPUs) • Infiniband interconnect • 192 nodes • Runs spanning Abe and Lincoln nodes may be possible in the future • Multiple powerful CPUs and GPUs on each node. How do we take advantage of all of it? 3

Approach • Over-decompose the mesh into many partitions per node • Write GPU and CPU implementations of computational kernels • Each partition can be handled by either CPU or GPU • Choose a mapping of partitions to hardware that maximizes utilization • Partitioning, ghost management, and synchronization is handled by ParFUM on the CPU • Goal: flexibility in number/size of partitions and • assignment of partitions to hardware 4

ParFUM Hybrid API • Management of host/device node and element data • Compatible API for writing CPU and GPU kernels • On CPU: loop over nodes or elements using iterators • On GPU: each thread is responsible for one node or element • Functions for inter-partition synchronization

ParFUM Hybrid API • nodeIterator itr; • for (nodeItr_Begin(itr); nodeItr_IsValid(itr); • nodeItr_Next(itr)) { • n_data = node_GetData(itr); • for (int i=0; i<dof; ++i) { • float a_old = n_data->a[i]; • n_data->a[i] = -n_data->F[i] / n_data->mass; • n_data->v[i] += • 0.5 * dt * (n_data->a[i] + a_old); • } • }

ParFUM Hybrid API • n_data = node_GPU_GetData(my_node); • for (int i=0; i<dof; ++i) { • float a_old = n_data->a[i]; • n_data->a[i] = -n_data->F[i] / n_data->mass; • n_data->v[i] += • 0.5 * dt * (n_data->a[i] + a_old); • } • }

Managing Data Races • Independent GPU threads introduce races when they update common data structures. • (e.g. nodes updating shared element quantities) • Perform each write into a separate slot in the element data structure, accumulate values in next element kernel. • Possible alternative: graph coloring

Managing GPU Partitions • Each GPU partition needs to synchronize with the host at each step • We expect a large number of GPU partitions, so how will they be managed? CPU Partition CPU Partition CPU Partition CPU Partition CPU Partition CPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition GPU Partition CPU Partition CPU Partition GPU Partitions Normal CPU GPU Manager

Mapping and Load Balance

Mapping and Load Balance CPU Partitions (2 per core)

Mapping and Load Balance GPU Partitions (34 per node)

Optimization • Pack: identify the data needed for synchronization and copy only that data between host and device • Async: run all memory transfers and kernels asynchronously (enables overlap) • Overlap: let the GPU manager cores run CPU partitions while waiting for GPU partitions

Characterizing Performance CPU GPU

Scaling

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters • Aaron Becker (abecker3@illinois.edu) • Isaac Dooley • Laxmikant Kale • SAAHPC, July 30 2009 • Champaign-Urbana, IL

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters