Parallel Workload

Parallel Workload Jongeun Lee Fall 2013

N-Body • N-body problem: • to find the positions and velocities of a collection of interacting particles over a period of time • Eg: collection of stars (astrophysics), or of molecules or atoms (chemistry) • Input: the mass, position, and velocity of each particle at the start of the simulation • Output: the position and velocity of each particle at a sequence of user-specified times

The Problem • n-body solver that simulates the motions of planets or stars • Particle q, with mass mq, at time t, position sq(t), force fqk(t) • Total force on q, exerted by all particles 0, 1, …, n – 1 • Applying Newton’s law, Fq = mqaq = mqsq’’,gives us a system of differential equations to solve • Now let’s find s(t) and s’(t) at

Serial Program • Get input data • for each timestep: • Print positions and velocities of particles • for each particle q: • Compute total force on q • for each particle q: • Compute position and velocity of q

First Inner Loop: Basic Algorithm • for each particle q { • for each particle k != q { • x_diff= pos[q][X] - pos[k][X]; • y_diff= pos[q][Y] - pos[k][Y]; • dist= sqrt(x_diff*x_diff + y_diff*y_diff); • dist_cubed= dist*dist*dist; • forces[q][X] -= G*masses[q]*masses[k]/dist_cubed * x_diff; • forces[q][Y] -= G*masses[q]*masses[k]/dist_cubed * y_diff; • } • }

Reduced Algorithm • for each particle q • forces[q][X] = forces[q][Y] = 0; • for each particle q { • for each particle k > q { • x_diff= pos[q][X] - pos[k][X]; • y_diff= pos[q][Y] - pos[k][Y]; • dist= sqrt(x_diff*x_diff + y_diff*y_diff); • dist_cubed= dist*dist*dist; • force_qk[X] = G*masses[q]*masses[k]/dist_cubed * x_diff; • force_qk[Y] = G*masses[q]*masses[k]/dist_cubed * y_diff; • forces[q][X] += force_qk[X]; • forces[q][Y] += force_qk[Y]; • forces[k][X] -= force_qk[X]; • forces[k][Y] -= force_qk[Y]; • } • }

Integration

Completing Serial Program • Second inner loop: computing position and velocity of q pos[q][X] += delta_t*vel[q][X]; pos[q][Y] += delta_t*vel[q][Y]; vel[q][X] += delta_t/masses[q]*forces[q][X]; vel[q][Y] += delta_t/masses[q]*forces[q][Y];

Communication among Tasks

Mapping • How to map tasks to cores? • Two dimensions • n particles? • T timesteps? • Load balancing? • Shared memvs. Message passing • Optimized algorithms • Hierarchical methods (eg., BH, FMM)

Monte Carlo Method • Popular in • computational physics, numerical integration, optimization, etc. • Basic idea • How to calculate the probability of a solitaire game coming out successfully? • Very useful for • modeling phenomena with significant uncertainty in inputs (e.g., calculation of risk in business) • simulating systems with many coupled degrees of freedom • evaluating multidimensional definite integrals with complicated boundary conditions Source: wikipedia.com

MapReduce http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

MapReduce Example http://blogs.vmware.com/vfabric/files/2013/05/map-reduce-core-idea_numbered.jpg

Structured Grid • A simple stencil • More generally • physics simulation (e.g., simulation of the strong nuclear force, temperature of an oven, etc.) • multiple refinements, until error threshold is reached • many dimensions (over 10^7 for QCD), many nodes Value of red node is updated by a linear combination of the values of the blue nodes

Parallel implementation • Simpler case: 1~3 dimensional grids • assign to each processor a chunk of the grid, determined by a partitioning algorithm • each processor computes the updates of the chunk in each iteration • partitioning may be done statically or dynamically • Ghost nodes • for border grid points • Double-buffering

Advanced Solvers • Multi-Grids • instead of fixed size chunks, make several copies of the grid at various chunk granularities • result from one node can propagate more quickly to far-away node • faster convergence

Advanced Solvers • Adaptive mesh refinement • finer discretization in regions where solution changes more rapidly in space or time • convergence rate can improve vastly • after each update, check the solution, make a decision about whether to subdivide the region

Advanced Solvers • Invariants • no communication between PEs on a given refinement level, except for the ghost nodes • need only one or a small number of prior iterations

Example • Finite Difference method • simulate temperature of a rod insulated except at the ends • temperature: u(x,t) • heat equation

Seven Questions

Summary 1 • Application • 13 dwarfs that form the core of future apps • Spawn graphics, Machine Learning, AI, etc. • Hardware • 1000s of cores on a die • Heterogeneous cores for area, power advantages • Shared/transactional mem/full empty bits for synch • On-chip communication crucial for high performance

Summary II • Programming models • Based on psychology research to make them more intuitive • Independent of number of processors • Systems software • Using search-bound auto tuners instead of compilers • Virtual machine based approach • Success Metrics • Maximize programmers productivity and app performance • Hardware counters & monitors to help caliberate success • Others?

Summary III “This report is intended to be the start of a conversation about these perspectives. There is an open, exciting, and urgent research agenda…”

Parallel Workload

Parallel Workload

Presentation Transcript

Student Workload

MENTAL WORKLOAD ASSESSMENT

Workload Modelling Tool

Faculty Workload Workshop

Workload Management

Coordinated Workload Scheduling

Mental Workload

Thinking About Workload

Improving crew workload

Workload Model

Workload Agreements

Workload Forecast System

Workload Analysis

Workload Analysis

ENTERPRISE BARGAINING ~WORKLOAD~

Teacher Workload

Stress and Workload

eis.uk/workload

WORKLOAD WORKSHOP

Workload Management System

Managing Oracle Workload with z/OS Workload Manager