210 likes | 356 Views
Simulating Collective Effects on GPUs. Final Presentation @ CERN, 19.02.2016. Master‘s T hesis CSE, ETHZ Stefan Hegglin. Supervision: Prof. Dr. P. Arbenz , ETHZ Dr. K. Li, CERN BE-ABP-HSC. Overview. PyHEADTAIL Goal of Thesis Implementation / Methods
E N D
Simulating Collective Effects on GPUs Final Presentation @ CERN, 19.02.2016 Master‘sThesis CSE, ETHZ Stefan Hegglin Supervision: Prof. Dr. P. Arbenz, ETHZ Dr. K. Li, CERN BE-ABP-HSC
Overview PyHEADTAIL Goal of Thesis Implementation / Methods Results for users, developers & profiling
PyHEADTAIL • Simulation code at CERN to study collective effects • Scriptable • Fully dynamic (Python) • Extensible: PyECLOUD… • Easy to use Beam Instabilities: Numerical Model, Kevin Li, unpublished manuscript
PyHEADTAIL: Model Drift/Kick • Splits synchrotron into segments • Linear tracking along ring segments • Kicks applied between segments Collective effects, dampers,… Macro-particles • typically 105-107 are tracked y x z
Scope of Thesis • Develop GPU interface for PyHEADTAIL • Speedup • Simplicity for user • Simplicity for other developers • Make CPU/GPU switch easy • Strategies used: • PyCUDA: Interface between Python and CUDA
Methods / Implementation: Strategies Use GPUArrays as much as possible No rewriting of code Hide all GPU-specific algorithms behind a layer Transparent to user and other developers Some adaptions to PyCUDA to unify interface and increase scope Streams to compute statistics of independent dimensions Speedup Optimise functions only if profiling deems it necessary
Results: Software Design • Setup simulation • Set context with context manager • Track bunch: correct implementation automatically chosen • Move data back to CPU The user does not need to care about the system internals
Context & Contextmanager Details Context: general/pmath.py A module/file which contains: • A dictionary per context referencing the function implementations (GPU/CPU) • A function to update the currently active dictionary: Spills its contents to the module-global namespace Functions are callable via pmath.functionname() Contextmanager: gpu/contextmanager.py with GPU(bunch): track() A Class which can be used in a with-statement which: • Switches the implementations by updating the active dictionary (update_active_dict() upon entering the with()-statement) • Moves the bunch-data to and from the GPU in its __enter__ and __exit__ methods
Results: Qualitative User: add two lines of code to script import contextmanager.GPU as GPU for n in xrange(nturns): machine.track(bunch) Developer: Write code (almost) as before: Dispatch all mathematical function calls via context: sin(x) --> pm.sin(x) Statements like these work out of the box due to GPUArrays: bunch.z -= a * bunch.mean_x() import contextmanager.GPU as GPU with GPU(bunch): And use_cython=False in the new BasicSynchrotron class
Developer: Available functions • Mathematical: sin, cos, exp, arcsin, min, max, floor • Statistical: mean, std, emittance • Array: diff, cumsum, seq, arange, argsort, apply_permutation, take, convolve (GPU on CPU) • Slicing:mean_per_slice, std_per_slice, particles_within_cuts, macroparticles_per_slice, serchsortedleft, searchsortedright • Creation: zeros, ones • Marker: device • Monitor:init_bunch_buffer, init_slice_buffer Example: defmy_kick(bunch): print ‘Running on ‘, pm.device() bunch.x -= pm.sin(bunch.z) bunch.xp *= bunch.mean_xp() a = pm.zeros(100, dtype=np.float64) bunch.z = pm.take(bunch.dp, indices) … bunch.z -= 3 * a
Developer: How to add new functionality defmy_kick(bunch): print ‘Running on ‘, pm.device() p = np.fft(bunch.x) p *= factor bunch.x = np.ifft(p) What do I do? What to do: • Create an entry in the pmath function dictionaries for both CPU and GPU with the same interface • Implement both versions and call them from the dictionary (not necessary if one-liner such as np.cos, store in dict directly) • Add tests which compare the two versions in test_dispatch.py (see examples there) • Check PyCUDA and scikit-cuda before implementing own kernels!
User: Available trackers & kicks Available and tested in branch: feature/PyPIC_integration Adrian • Tracker: transverse map with/without detuning and dispersion, rf systems, linear longitudinal, drift, … • Kick: wake kick (all types, convolution performed on CPU atm!), damper, rfq • Slicing: uniform bin slicing • Monitor: bunchmonitor, slicemonitor
Results: Profiling Transverse map embarrassingly parallel speedup up to x27 saturated at > 106mp Why not faster? PyCUDA/GPUArray... Wake field 500 slices Speedup of up to x6 Convolution on CPU < 10% of runtime No speedup for <105mp
Discussion: GPUArray overhead A statement like: x = a*x + b*y invokes 3 kernel calls via PyCUDA: tmp b*y tmp2 a*x x tmp + tmp2 This leads to: • Low arithmetic intensity • Lots of kernel call overhead, especially for small problem sizes • Memory allocation mitigated by using memory pool
Results: Benchmark Study Typical application: LHC@injection instability, wake field & damper CPU Time: ~ 1day GPU Time: 5x less Two lines of code added to script Benchmark for CPU Results agree
Results: Benchmark Study Typical application: LHC@injection instability, wake field & damper CPU Time: ~ 1day GPU Time: 5x less Two lines of code added to script Benchmark for CPU Results agree
Conclusion PyHEADTAIL was successfully ported to GPUs and benchmarked against the previous implementation + easy to use + extensible + maintainable (Python) - not fully exploiting GPU - dependent on PyCUDA https://commons.wikimedia.org