Accelerating an N-Body Simulation

Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies

CPU loads particle data into DRAM for every iteration. (every N*N cycles)

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle.

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle. 16 force computations are done based on 16 scalar inputs and the 4 values read earlier. The pipeline and accumulator are described in another slide.

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle. 16 force computations are done based on 16 scalar inputs and the 4 values read earlier. The pipeline and accumulator are described in another slide. Every pipeline outputs 12 partial sums after ‘N’ cycles.

CPU loads particle data into DRAM for every iteration. (every N*N cycles) A new set of 4 values is read from DRAM in every cycle. 16 force computations are done based on 16 scalar inputs and the 4 values read earlier. The pipeline and accumulator are described in another slide. Every pipeline outputs 12 partial sums after ‘N’ cycles. CPU adds the 12 partial sums together (for every particle), updates velocities, updates positions and re-writes into the DRAM.

for(int j=0;j<N/PAR;j++) { max_set_scalar_input(device,"RowSumKernel.N",N,FPGA_A);//set scalar inputs max_set_scalar_input_f(device,"RowSumKernel.EPS",EPS,FPGA_A); for(int p=0;p<PAR;p++) { max_set_scalar_input_f(device,pi_x[p],px[j*PAR+p],FPGA_A); max_set_scalar_input_f(device,pi_y[p],py[j*PAR+p],FPGA_A); max_set_scalar_input_f(device,pi_z[p],pz[j*PAR+p],FPGA_A); } max_run//run the kernel ( device, max_output("ax",outputX,12*PAR*sizeof(float)), max_output("ay",outputY,12*PAR*sizeof(float)), max_output("az",outputZ,12*PAR*sizeof(float)), max_runfor("RowSumKernel",N), max_end() ); for(int i=0;i<12*PAR;i++) //sum up the partial sums { ax[j*PAR+(i/12)]+=outputX[i]; ay[j*PAR+(i/12)]+=outputY[i]; az[j*PAR+(i/12)]+=outputZ[i]; } } //update velocity //update position //load memory N/PAR times N Cycles Host C code

Pipeline and Accumulator: 1 Input per cycle: P_j data from DRAM. Acceleration: accumulated as 12 partial sums.

Resource Usage Resource Usage for 16 fold parallel kernel @ 150MHz: LUTs: 156032 / 297600 (52.43%) FFs: 166543 / 595200 (27.98%) BRAMs: 433 / 1064 (40.70%) 288 / 2016 (14.29%)

Performance: Comparison Seconds Particles

Performance: Speedup Speedup Particles

38400 Particles

Accelerating an N-Body Simulation

Accelerating an N-Body Simulation

Presentation Transcript

Timewarp Rigid Body Simulation

Gravitational N-body Simulation

N-Body Simulations

N Body Gravitational Problem

An Age of Accelerating Connections

Augmented Reality Based Body Sound Simulation

Social Simulation – an introduction

Barnes Hut N-body Simulation

Huma n body

Advanced Computer Graphics Rigid Body Simulation

Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures

HEADTAIL simulation during the accelerating ramp in the PS

Newtonian N-Body Simulator

N-body introduction

Looking Under The Hood : An N-body Mechanic Tutorial

Cosmological N-Body Simulation - Topology of Large scale Structure

Accelerating Multiprocessor Simulation

Human Body Drug Simulation

n _TOF EAR2 Simulation Studies

Deformable Body Simulation

Accelerating

Power Requirements of An Accelerating Vehicle