260 likes | 403 Views
Design and Testing of GPU Server based RTC for TMT NFIRAOS. Lianqi Wang. AO4ELT3 Florence, Italy 5/31/2013. Thirty Meter Telescope Project. Introduction. Part of the on-going RTC architecture trade study for NFIRAOS Handle MCAO and NGS SCAO Mode Iterative algorithms
E N D
Design and Testing of GPU Server based RTC for TMT NFIRAOS Lianqi Wang AO4ELT3 Florence, Italy 5/31/2013 Thirty Meter Telescope Project
Introduction • Part of the on-going RTC architecture trade study for NFIRAOS • Handle MCAO and NGS SCAO Mode • Iterative algorithms • Hardware: FPGA, GPU, etc. • Matrix Vector Multiply (MVM) algorithm • Made possible by using GPUs to compute the control matrix • Hardware: GPU, Xeon Phi, or just CPUS • In this talk • GPU server based RTC design • Benchmarking the 10 GbEinterface and the whole real time chain • Updating the control matrix as seeing changes
Most Challenging Requirements • LGS WFS Reconstruction: • Processing pixels from 6 order 60x60 LGS WFS • 2896 subapertures per WFS. • Average 70 pixels per sub-aps.(0.4MB per WFS, 2.3MB total) • CCD read out: 0.5 ms. • Compute DM commands from gradients • 6981 active (7673 total) DM actuators from 35k grads • Using Iterative algorithms or matrix vector multiply (MVM). • At 800 Hz. All has to finish in ~1.25ms • This talk focuses on MVM, in GPUs.
Number of GPUs for Applying MVM 925 MB control matrix. Assuming 1.00 ms total time 2 GPU per WFS Dual GPU board 1: http://www.intel.com/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf
LGS WFS to RTC Interface • 10GbE surpasses the requirement of 800 MB/s • Widely deployed • Low cost • Alternatives: • sFPDP, Camera Link, etc • Testing with our Emulex 10 GbE PCI-E Board • Plain TCP/IP • No special feature
A Sends 0.4 MB of Pixels to B100,000 Frames Latency: median: 0.465ms 880MB/s Jitter: RMS: 13 ms, p/v: 115ms CPU load: 100% of one core
B Sends 0.4MB of Pixels to A100,000 Frames Latency: median: 0.433ms 950MB/s Jitter: RMS: 6 ms, p/v: 90ms
Server with GPUs 1U or 2U Chassis Memory x8 10GbE Adapter WFS DC PCIe x16 CPU 1 WFS DC K10GPU Module x16 K10 GPU Module Central Server CPU 2 PCIe x8 Built-in10GbE x16 RPG x16 Memory (not accessible in 1U form) • Each K10 GPU module has two GPUs
Benchmarking the End to End Latency to Qualify this Architecture • Simulation Scheme: Server A Server B TCP over 10GbE Pixels One LGS WFS Camera 1/6 of RTC DM Commands Central Server
Benchmarking Hardware Configuration • Hardware • Server A: Dual Xeon W5590 @3.33Ghz • Server B: Intel Core i7 3820 @ 3.60 GHz • Two NVIDIA GTX 580 GPU board • or One NVIDIA GTX 590 dual GPU board • Emulex OCe11102-NX PCIe 2.0 10Gbase-CR • Connected back to back
Benchmarking Software Configuration • Servers optimization for low latency • Runs real time preempt Linux kernel (rt-linux-3.6) • CPU affinity set to core 0. • Lock all memory pages • Scheduler: SCHED_FIFO at maximum priority • Priority: -18 • Disable hyper-threading • Disable virtualization technology and Vt-d in bios (critical) • Disable all non-essential services • No other tasks running • CUDA 4.0 C runtime library • Vanilla TCP/IP
Pipelining Scheme WFS Read out (0.5ms) • Benchmarked for an LGS WFS • MVM takes most of the time. • Memory copying is indeed concurrent with computing … Not to scale … Stream 1 … Stream 2 … Stream 3 Stream synchronized with events Frame start 1.25ms Next Frame
End to End Timing for 100 AO Seconds 2x GTX 580 Latency Mean: 0.85ms Jitter RMS: 12ms. Jitter P/V: 114ms.
End to End Timing for 9 AO hours2x GTX 580 Latency Mean: 0.85ms Jitter RMS: 13ms. Jitter P/V: 144ms.
End to End Timing for 100 AO SecondsGTX 590 (dual GPU board) Latency Mean: 0.96ms Jitter RMS: 9ms. Jitter P/V: 111ms.
Summary • For one LGS WFS • With 1 Nvidia GTX 590 board and 10 GbE interface • <1 ms from end of exposure to DM command ready • Pixel read out and transport (0.5ms) • Gradient computation • MVM computation • RTC is made up of 6 such subsystems • And one more for soft real-time or background task. • Plus one more for (online) spare.
Compute the Control Matrix • by solving the minimum variance reconstructor in GPUs. (34752x6981)=(34752x62311) x(62311x62311)-1x(62311x6981) x(6981x6981)-1 • Cn2 profile is used for regularization • It needs to be updated frequently • Control matrix need to be updated, using warm restart FDPCG how many iterations? • Benchmarking results of iterative algorithms in GPUs
Benchmarking Results of Iterative Algorithms for Tomography CG: Conjugate Gradients FD: Fourier Domain Preconditioned CG. OSn: Over sampling n tomography layers (¼ m spacing)
How Often to Update the Reconstructor? • Tony Travouillon (TMT) computed covariance function of Cn2 profile from 5 years of site testing data • One function per layer. Independent between layers • Measurement has minimum separation of 1 minute • Interpolate the covariance to finer time scale • Generate Cn2 profile time series at 10 second separation • Pick a case every 500 seconds (60 total for 8 hour duration) • Run MAOS simulations (2000 time steps) with • Initializing control matrix with “true” Cn2 profile • FD100 Cold (start) • Update control matrix using results from 10 seconds earlier • FD10 or FD20 Warm (restart)
Distribution of Seeing of the 60 Cases r0 at 10 seconds earlier Case
Performance of MVM FD100 Cold vs CG30 (baseline iterative algorithm) Only 1 point: MVM worse than CG30
MVM Control Matrix Update(FD10/20 Warm vs FD100 Cold) 8 GPU can do FD10 in 10 seconds.
Conclusions • We demonstrated that end to end hard real time process of the RTC can be done using 10 GbE and GPUs • Pixel transfer over 10GbE • Gradient computation and MVM in 1 GPU board per WFS • Total latency is 0.95 ms. P/V jitter is ~0.1 ms. • Control matrix update • Largely sufficient with FD10 warm restart for 10 second stale Cn2 profile.
Acknowledgements • The author gratefully acknowledges the support of the TMT collaborating institutions. They are • the Association of Canadian Universities for Research in Astronomy (ACURA), • the California Institute of Technology, • the University of California, • the National Astronomical Observatory of Japan, • the National Astronomical Observatories of China and their consortium partners, • and the Department of Science and Technology of India and their supported institutes. • This work was supported as well by • the Gordon and Betty Moore Foundation, • the Canada Foundation for Innovation, • the Ontario Ministry of Research and Innovation, • the National Research Council of Canada, • the Natural Sciences and Engineering Research Council of Canada, • the British Columbia Knowledge Development Fund, • the Association of Universities for Research in Astronomy (AURA) • and the U.S. National Science Foundation.
Other tasks • Copying updated MVM matrix to RTC • Do so after DM actuator commands are ready • Measured 0.1 ms for 10 columns • 579 time steps to copy 5790 columns. • Collect statistics to update matched filter coefficients • Do so after DM actuator commands are ready or in CPUs • Complete before next frame arrives.