230 likes | 326 Views
Parallel Workload. Jongeun Lee. Fall 2013. N-Body. N-body problem: to find the positions and velocities of a collection of interacting particles over a period of time Eg : collection of stars (astrophysics), or of molecules or atoms (chemistry)
E N D
Parallel Workload Jongeun Lee Fall 2013
N-Body • N-body problem: • to find the positions and velocities of a collection of interacting particles over a period of time • Eg: collection of stars (astrophysics), or of molecules or atoms (chemistry) • Input: the mass, position, and velocity of each particle at the start of the simulation • Output: the position and velocity of each particle at a sequence of user-specified times
The Problem • n-body solver that simulates the motions of planets or stars • Particle q, with mass mq, at time t, position sq(t), force fqk(t) • Total force on q, exerted by all particles 0, 1, …, n – 1 • Applying Newton’s law, Fq = mqaq = mqsq’’,gives us a system of differential equations to solve • Now let’s find s(t) and s’(t) at
Serial Program • Get input data • for each timestep: • Print positions and velocities of particles • for each particle q: • Compute total force on q • for each particle q: • Compute position and velocity of q
First Inner Loop: Basic Algorithm • for each particle q { • for each particle k != q { • x_diff= pos[q][X] - pos[k][X]; • y_diff= pos[q][Y] - pos[k][Y]; • dist= sqrt(x_diff*x_diff + y_diff*y_diff); • dist_cubed= dist*dist*dist; • forces[q][X] -= G*masses[q]*masses[k]/dist_cubed * x_diff; • forces[q][Y] -= G*masses[q]*masses[k]/dist_cubed * y_diff; • } • }
Reduced Algorithm • for each particle q • forces[q][X] = forces[q][Y] = 0; • for each particle q { • for each particle k > q { • x_diff= pos[q][X] - pos[k][X]; • y_diff= pos[q][Y] - pos[k][Y]; • dist= sqrt(x_diff*x_diff + y_diff*y_diff); • dist_cubed= dist*dist*dist; • force_qk[X] = G*masses[q]*masses[k]/dist_cubed * x_diff; • force_qk[Y] = G*masses[q]*masses[k]/dist_cubed * y_diff; • forces[q][X] += force_qk[X]; • forces[q][Y] += force_qk[Y]; • forces[k][X] -= force_qk[X]; • forces[k][Y] -= force_qk[Y]; • } • }
Completing Serial Program • Second inner loop: computing position and velocity of q pos[q][X] += delta_t*vel[q][X]; pos[q][Y] += delta_t*vel[q][Y]; vel[q][X] += delta_t/masses[q]*forces[q][X]; vel[q][Y] += delta_t/masses[q]*forces[q][Y];
Mapping • How to map tasks to cores? • Two dimensions • n particles? • T timesteps? • Load balancing? • Shared memvs. Message passing • Optimized algorithms • Hierarchical methods (eg., BH, FMM)
Monte Carlo Method • Popular in • computational physics, numerical integration, optimization, etc. • Basic idea • How to calculate the probability of a solitaire game coming out successfully? • Very useful for • modeling phenomena with significant uncertainty in inputs (e.g., calculation of risk in business) • simulating systems with many coupled degrees of freedom • evaluating multidimensional definite integrals with complicated boundary conditions Source: wikipedia.com
MapReduce http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
MapReduce Example http://blogs.vmware.com/vfabric/files/2013/05/map-reduce-core-idea_numbered.jpg
Structured Grid • A simple stencil • More generally • physics simulation (e.g., simulation of the strong nuclear force, temperature of an oven, etc.) • multiple refinements, until error threshold is reached • many dimensions (over 10^7 for QCD), many nodes Value of red node is updated by a linear combination of the values of the blue nodes
Parallel implementation • Simpler case: 1~3 dimensional grids • assign to each processor a chunk of the grid, determined by a partitioning algorithm • each processor computes the updates of the chunk in each iteration • partitioning may be done statically or dynamically • Ghost nodes • for border grid points • Double-buffering
Advanced Solvers • Multi-Grids • instead of fixed size chunks, make several copies of the grid at various chunk granularities • result from one node can propagate more quickly to far-away node • faster convergence
Advanced Solvers • Adaptive mesh refinement • finer discretization in regions where solution changes more rapidly in space or time • convergence rate can improve vastly • after each update, check the solution, make a decision about whether to subdivide the region
Advanced Solvers • Invariants • no communication between PEs on a given refinement level, except for the ghost nodes • need only one or a small number of prior iterations
Example • Finite Difference method • simulate temperature of a rod insulated except at the ends • temperature: u(x,t) • heat equation
Summary 1 • Application • 13 dwarfs that form the core of future apps • Spawn graphics, Machine Learning, AI, etc. • Hardware • 1000s of cores on a die • Heterogeneous cores for area, power advantages • Shared/transactional mem/full empty bits for synch • On-chip communication crucial for high performance
Summary II • Programming models • Based on psychology research to make them more intuitive • Independent of number of processors • Systems software • Using search-bound auto tuners instead of compilers • Virtual machine based approach • Success Metrics • Maximize programmers productivity and app performance • Hardware counters & monitors to help caliberate success • Others?
Summary III “This report is intended to be the start of a conversation about these perspectives. There is an open, exciting, and urgent research agenda…”