270 likes | 446 Views
N-Body Simulations. Kenneth Owens. The N-Body Problem. We wish to compute the interaction between particles (bodies) given their mass and positions Simulation is performed in time steps Forces between all bodies is computed O(n 2 )
E N D
N-Body Simulations Kenneth Owens
The N-Body Problem • We wish to compute the interaction between particles (bodies) given their mass and positions • Simulation is performed in time steps • Forces between all bodies is computed O(n2) • Positions for all bodies are updated based on their current kinematics and the interaction with other bodies O(n) • Time moves forward by one step
Force Computation • The force between a body i and N other bodies is approximated as above by computing the interaction given their mass (m), the distance vector between them (r_ij), and a softening factor (ε). • This is computed for all bodies with all other bodies
Position Updates • Euler Method: For each particle, a discrete timestep (dt) is used to approximate the continuous kinematic equation and update the position and velocity of each particle
Project Objective • Execute an n-body simulation on a distributed memory architecture with multiple GPUs on each node
Project Accomplishments • Sequential implementation of n-body simulation code • Written in C • Compiled using gcc-4.4 with –O3 • MPI implementation • Written in C • Compiled using mpicc.mpich with gcc-4.4 using 0-3 • Executed using mpirun.mpich on 2,5, an 10 nodes • GPU implementation • Written in C with CUDA extensions • Compiled using nvcc with gcc-4.4 using –O3 • Executed on Nvidia 580s • MPI-GPU implementation • The MPI driver above was combined with the GPU kernel implementation • Compiled but not tested for correctness
Driver Source Code voidnbody(vector4d_t* positions, vector4d_t* velocities, vector4d_t*current_positions, vector4d_t*current_velocities, vector3d_t*accel,size_t size, value_tdt,value_t damping, value_tsoftening_squared) { compute_forces(positions,accel, size, positions, size,softening_squared); update_positions(positions, velocities,current_positions,current_velocities,accel, size,dt, damping); } • The main method of the driver calls nbody • nbody calls two externally linked function • compute_forces computes the interactions • update_positions updates the kinematics
Sequential compute_forces • Computes the pair-wise interaction • Hidden second loop in acceleration function voidcompute_forces(vector4d_t* positions, vector3d_t* forces,size_tpositions_size, vector4d_t* sources,size_tsources_size,value_tsoftening_squared) { for(size_ti=0;i<positions_size;i++) { forces[i]= acceleration(positions[i],sources,sources_size, forces[i],softening_squared); } }
Sequential Interaction Computation • Computation for individual interaction written using c vector3d_t interaction(vector3d_t acceleration, vector4d_t body1, vector4d_t body2,value_tsoftening_squared) { vector3d_t force; force.x= body1.x - body2.x; force.y= body1.y - body2.y; force.z= body1.z - body2.z; floatdistSqr=force.x*force.x+force.y*force.y+force.z*force.z; distSqr+=softening_squared; floatinvDist=1.0f/sqrt(distSqr); floatinvDistCube=invDist*invDist*invDist; float s = body2.w *invDistCube; acceleration.x+=force.x* s; acceleration.y+=force.y* s; acceleration.z+=force.z* s; return acceleration; }
Sequential Kinematics Update • Updates each position based on the computed forces voidupdate_positions(vector4d_t* positions, vector4d_t* velocities, vector4d_t*current_positions, vector4d_t*current_velocities, vector3d_t* acceleration, size_t size, value_tdt,value_t damping) { for(size_ti=0;i< size;i++) { vector4d_t current_position=current_positions[i]; vector3d_t accel= acceleration[i]; vector4d_t current_velocity=current_velocities[i]; update_position(&positions[i],&velocities[i],current_position,current_velocity,accel,dt, damping); } }
Sequential Update Function • Implements the previously shown equations voidupdate_position(vector4d_t* position, vector4d_t* velocity, vector4d_t current_position, vector4d_t current_velocity, vector3d_t acceleration,value_tdt,value_t damping) { current_velocity.x+=acceleration.x*dt; current_velocity.y+=acceleration.y*dt; current_velocity.z+=acceleration.z*dt; current_velocity.x*= damping; current_velocity.y*= damping; current_velocity.z*= damping; current_position.x+=current_velocity.x*dt; current_position.y+=current_velocity.y*dt; current_position.z+=current_velocity.z*dt; *position =current_position; *velocity =current_velocity; }
CUDA Implementation • Started with the implementation from GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html • Modified the code to work with data sizes that are larger than 256 but that are not evenly divisible by 256 • Added kinematics update • Code no longer works for sizes less than 256 • Needed command line specification to control grid and block size anyway
CUDA compute_forces • Copies to device memory and execute the compute_force_gpu kernel • Note - cudaMemAlloc truncated to fit code voidcompute_forces(vector4d_t* positions, vector3d_t* forces,size_tpositions_size, vector4d_t* sources,size_tsources_size,value_tsoftening_squared) { ….. compute_forces_gpu<<< grid, threads,sharedMemSize>>>(device_positions,device_forces,positions_size, device_sources,sources_size, softening_squared ); cudaThreadSynchronize(); cudaMemcpy(forces,device_forces,positions_size*sizeof(float3), cudaMemcpyDeviceToHost); cudaFree((void**)device_positions); cudaFree((void**)device_sources); cudaFree((void**)device_forces); err =cudaGetLastError(); if(cudaSuccess!= err) { fprintf(stderr,"Cuda error: %s: \n", cudaGetErrorString( err));
CUDA compute_forces_gpu Kernel • Every thread computes the acceleration for its position and moves to the next block • For our test sizes this only implemented cleanup for strides not divisible by 256 __global__ void compute_forces_gpu(float4* positions, float3*forces,int size, float4* sources,intsources_size, floatsoftening_squared ) { for(int index = __mul24(blockIdx.x,blockDim.x)+threadIdx.x; index < size; index +=blockDim.x*gridDim.x) { float4 pos = positions[index]; forces[index]= acceleration(pos, sources,sources_size, forces[index],softening_squared);
CUDA force computation • Uses float3 and float4 instead of home brewed vector types • Shared memory is used 256 positions per block • Each thread strides across the grid to update a single particle __device__ float3 acceleration(float4 position, float4* positions,int size, float3 acc,floatsoftening_squared) { extern __shared__ float4 sharedPos[]; int p =blockDim.x; int q =blockDim.y; int n = size; intnumTiles= n /(p * q); for(int tile =blockIdx.y; tile <numTiles+blockIdx.y; tile++) { sharedPos[threadIdx.x+blockDim.x*threadIdx.y]= positions[WRAP(blockIdx.x+tile,gridDim.x)* p +threadIdx.x]; __syncthreads(); // This is the "tile_calculation" function from the GPUG3 article. acc = gravitation(position,acc,softening_squared); __syncthreads(); } return acc; }
CUDA update_positions • Kernel strides in the same way as the force computation • All threads update a single position simulaneously __global__ void update_positions_gpu(float4* positions, float4* velocities, float4*current_positions, float4*current_velocities, float3* forces,int size, floatdt,float damping) { for(int index = __mul24(blockIdx.x,blockDim.x)+threadIdx.x; index < size; index +=blockDim.x*gridDim.x) { float4 pos =current_positions[index]; float3 accel= forces[index]; float4 vel=current_velocities[index]; vel.x+=accel.x*dt; vel.y+=accel.y*dt; vel.z+=accel.z*dt; vel.x*= damping; vel.y*= damping; vel.z*= damping; // new position = old position + velocity * deltaTime pos.x+=vel.x*dt; pos.y+=vel.y*dt; pos.z+=vel.z*dt; // store new position and velocity positions[index]= pos; velocities[index]=vel; } }
MPI implementation • O(n2)/p pipeline implementation • Particles are divided among processes • Particle positions are shared in a ring communication topology • Force computation occurs for all particles by sending the data around the ring • After all forces are computed each process updates the kinematics of its own particles
MPI Driver • Compiles with CPU and GPU implementations • Timings have only been collected for CPU for(size_ti=0;i<time_steps;i++) { memcpy(sendbuf,current_positions,num_particles*sizeof(vector4d_t)); for(pipe=0; pipe<size; pipe++){ if(pipe != size-1){ MPI_Isend(sendbuf,num_particles, mpi_vector4d_t, right, pipe, commring,&request[0]); MPI_Irecv(recvbuf,num_particles, mpi_vector4d_t, left, pipe, commring,&request[1]); } compute_forces(positions,accel,num_particles, positions,num_particles,softening_squared); if(pipe != size-1) MPI_Waitall(2, request, statuses ); memcpy(sendbuf,recvbuf,num_particles*sizeof(vector4d_t)); } update_positions(positions, velocities,current_positions,current_velocities,accel,num_particles,dt, damping); }
Timings • Taken of float for sequential and gpu • Taken on tux for mpi • All used 10 iterations for time steps • Wallclock time was collected for comparison • Memory allocation time was omitted • Except for device memory allocation and device data transfer • Timings where not collected for the code using MPI to distribute data over multiple nodes with multiple GPUs
Conclusions • We achieved several orders of magnitude speed-up going to a GPU • We achieved similar results to what was obtained in GPU gems • The sequential implementation was not optimal as it did not use SSE or multiple cores – much lower than the theoretical possible FLOPs for the Xeon CPU • The MPI driver showed that task level parallelism can be exploited using distributed memory computing
To Do List • Run the MPI GPU version on Draco • FMM (Fast Multiple Method) MPI implementation • Multi-device GPU implementation