320 likes | 456 Views
June 18 th , 2013 – New York City, NY, USA. Neutron Sensitivity and Software Hardening Strategies for Matrix Multiplication and FFT on Graphics Processing Units. P. Rech , L. Pilla, F. Silvestri, P. O. Navaux, and Luigi Carro. Outline. Radiation Effects on Graphics Processing Units
E N D
June 18th, 2013 – New York City, NY, USA Neutron Sensitivity and Software Hardening Strategiesfor Matrix Multiplication and FFT on Graphics Processing Units P. Rech, L. Pilla, F. Silvestri, P. O. Navaux, and Luigi Carro
Outline • Radiation Effects on Graphics Processing Units • Experimental Setup • Matrix Multiplication • Error Rate at Sea Level • Hardening Techniques • Fast Fourier Transform • Error Rate at Sea Level • Hardening Techniques • Conclusions 2/27
Outline • Radiation Effects on Graphics Processing Units • Experimental Setup • Matrix Multiplication • Error Rate at Sea Level • Hardening Techniques • Fast Fourier Transform • Error Rate at Sea Level • Hardening Techniques • Conclusions
Terrestrial Radiation Environment Radiation is an issue at sea level!! Galactic cosmic rays interact with atmosphere shower of energetic particles: - Muons - Pions - Protons - Gamma rays - Neutrons 13 n/(cm2h) @sea level 3/27
GPU Internal Structure Streaming Multiprocessor GPU Shared Memory Reg Reg Reg A GPU is an array of Streaming Multiprocessors Thread Thread Thread Reg Reg Reg Thread Thread Thread The SMs share DRAM DRAM SM executes various threads in parallel Threads has access to Registers and Shared Memory 4/27
Radiation Effects on a GPU Streaming Multiprocessor GPU SEU Shared Memory Reg Reg Reg SEU SEU Thread Thread Thread Reg Reg Reg Thread Thread Thread DRAM SET Radiation can corrupt memory resources (SEU)… and control circuitry: …but also logic (SET) a scheduler failure may have severe repercussions 5/27
Why Radiation Test on GPUs? Titan (Oak Ridge National Lab): 18,000 GPUs High probability of having a GPU corrupted Pedestrian Detection* *From 2015: 5 stars of security only to cars with pedestrian detection (Euro NCAP) NVIDIA Tegra High reliability is required 6/27
Outline • Radiation Effects on Graphics Processing Units • Experimental Setup • Matrix Multiplication • Error Rate at Sea Level • Hardening Techniques • Fast Fourier Transform • Error Rate at Sea Level • Hardening Techniques • Conclusions
Tested Devices NVIDIA GeeForce GTX480 (desktop board) NVIDIA TESLA C2050 (built-in ECC) 7/27
Radiation Test Facilities p+ 8/27
Radiation Test Facilities Weapon Nuclear Research 10/27
Neutrons Spectrum 1 sec @ISIS = 107 sec (110 days) of natural irradiation @NYC 11/27
GPU Radiation Test Setup PC inside the room but out of the beam Beam spot PCI-E bus extension between PC and GPU Extension with fuses on power lines to avoid GPU latchups to affect the PC PC 20 cm PCI-E bus 12/27
GPU Radiation Test Setup power control circuitry failure could compromise the experience and the GPU DDR are out of beam Beam spot is 3cm wide: GPU fully irradiated GPU power control circuitry is out of beam 13/27
Outline • Radiation Effects on Graphics Processing Units • Experimental Setup • Matrix Multiplication • Error Rate at Sea Level • Hardening Techniques • Fast Fourier Transform • Error Rate at Sea Level • Hardening Techniques • Conclusions
Matrix Multiplication 2048 elements 2048 elements B A M x = 2048 elements 2048 elements 2048 sum & mult 2048 sum & mult 2048 x 2048 threads 14/27
Matrix Multiplication Results Experimental Cross Section* @ISIS = 2.0110-6 cm2 *with double data Neutrons spectrum @ISIS resemble the atmospheric one The Cross Section @ISIS resemble the Cross Section @sea level Cross Section #Particles (@sea level) = Error Rate 2.60104 FIT 1 error every 4,5 years 2.0110-6 cm2 13 n/cm2/h = Titan (GTX): 18,000 errors every 4,5 years 10 errors per day! 15/27
Multiple Output Errors It was accredited that just single error affects output Experimental results: Single: 42.2% Multiple: 58.8% the majority of errors are multiple output errors 16/27
Multiple Output Errors Analysis Three different Multiple Errors patterns are detected: 1) 22.8% on the same Row Multiple 2) 26.8% on the same Column 3) 8% Cluster Errors Output Errors [%] M x x x x x x x x x x x x x x Column Single RND Row 17/27
Errors on Row/Column Causes B A M column is calculated using A rows and one column of B, stored in the GPU cache. … threads on a SM share cache x x M x x GPU cache … x x x x Cache corruption causes errors on row/column 18/27
Errors Correction • 1) ECC on Cache memory • - Corrects multiple errors on Row/Column, which are almost • 50% of the total (tested on C2050) • - Memory availability is reduced of 12.5%* • - Execution time is increased of up to 30%* *NVIDIA datasheet 2) Algorithm Based Fault Tolerance: technique specifically designed for an algorithm M A B x = ∑ row-check checksum col-check checksum ∑ *Freivalds ‘79 19/27
Matrix Multiplication ABFT M Single Errors* are detected in O(N) and corrected in O(1) ∑ row-check row-sum X X *Huang and Abraham ‘84 col-check X col-sum M X X X X Errors on a Row/Col* are detected in O(N) and corrected in O(1) row-check row-sum X X col-check X *P. Rech at al, ‘12 col-sum 20/27
Cluster Errors Causes • Cluster errors can be caused by • Cache cross-talk • Errors in dirty cache flags • Pairwise bit flips in cache • Scheduler failure Scheduler failure affects some threads synchronization or provides incomplete results M x x x Random locations of M result then erroneous x 21/27
Cluster Errors Criticality • Cluster errors: • not corrected by ECC (tested on C2050) • scheduler cannot be physically harden • scheduler SW hardening* not yet proved on GPU Multiple *Rossi et al. ’10 *Karimi et al. ‘10 Output Errors [%] Cluster errors are less likely to occur, however their FIT is 1.13103, which is not negligible! Column Single Row 22/27
Cluster Errors Correction various mismatches between M X X row-check row-sum X X various mismatches between row-check row-sum col-check col-sum X X checksum info is not enough for distinguishing errors but… col-check X X col-sum …we can try to correct errors with row-checksums or col-checksums and check if correction succeed M Experimentally observed corrupted location on a cluster ≤ 4: at most 16 checks are needed! X X X X 23/27
Outline • Radiation Effects on Graphics Processing Units • Experimental Setup • Matrix Multiplication • Error Rate at Sea Level • Hardening Techniques • Fast Fourier Transform • Error Rate at Sea Level • Hardening Techniques • Conclusions
Fast Fourier Transform 512x512 threads, each executing the Stockham algorithm on a 64-points FFT 64-points FFT at each iteration a thread updates 2-by-2 the 64 elemens log264=6 iterations required 64-points FFT 64-points FFT 64-points FFT 512 FFTs 64-points FFT 64-points FFT 64-points FFT 64-points FFT ... ... a thread in one iteration uses the output of previous threads as input 512 FFTs ... Threads are not independent, errors are likely to spread ... FFT cross section = 3.6910-6 cm2 (5.17105 FIT) 24/27
FFT Multiple Errors Less than 4% of execution has single errors few executions has odd amount of errors 10 9 8 7 6 Percentage of faulty FFT 5 4 3 Most executions has less than 32 errors or 64 (thread failure leads to the wrong update of all the 64 elements in the FFT) 2 1 0 2 4 6 9-11 14 16 18 20-21 24 26 28 30 32 34-39 42 44 46-47 50-51 54-55 57-59 62 64 66-126 128 >130 Software hardening idea: prevent errors propagation Multiple Errors 25/27
FFT Hardening inputcoding All errors are detected with a wise coding-decoding scheme*... *J.Y. Jou and Abraham ’88 *P. Rech and al. ‘13 ABFT Hardened FFT FFT ...but just when all iterations are completed: errors do propagate and FFT recomputation is required output decoding checksum generation Divide the N-FFT in N2-FFTs and N1-FFTs (N=N1*N2) performing coding-decoding-checksum on each smaller FFT... ...only the small FFT found corrupted has to be recomputed check FFT check error propagation computational overhead check 26/27
Outline • Radiation Effects on Graphics Processing Units • Experimental Setup • Matrix Multiplication • Error Rate at Sea Level • Hardening Techniques • Fast Fourier Transform • Error Rate at Sea Level • Hardening Techniques • Conclusions
Conclusions - GPUs are very prone to be corrupted by neutrons - The radiation response depends on executed algorithm - The corruption of shared and critical resources leads to multiple output errors - ECC is not sufficient to guarantee high reliability - Software-Based Hardening Strategies can be built analyzing the algorithm and experimental data Work in Progress: - Reduce scheduler strain optimizing thread distributions - Analyze cache flags corruptions - Evaluate error criticality (precision of data) 27/27