110 likes | 250 Views
By Andrew Woods Supervisor: Prof. Inggs & Dr Langman. Accelerating a Software Radio Astronomy Correlator. Correlator. Radio Telescopes have many separate antennas Use correlator to combine them to produce high resolution images Do this by correlating
E N D
By Andrew Woods Supervisor: Prof. Inggs & Dr Langman Accelerating a Software Radio Astronomy Correlator
Correlator • Radio Telescopes have many separate antennas • Use correlator to combine them to produce high resolution images • Do this by correlating • Frequency domain better for large inputs
FPGA • Used 2x Nallatech H101 Board • Has V4LX100, PCI-X interface, 16MB SRAM and 512MB DDR2 • Used Dime-C tools, which is a C like language to program. Aimed at software acceleration • -, FPGA achieved clock rates around 100MHz • +, can create custom hardware for application. • Parallel execution • Pipeline. HPRC Card
GPUs • Processing monsters • Achieved by using little cache and control • Used to be fixed functions. Recently programable. • People started using pixel shaders for GPP. • Nvidia have released CUDA, a language specifically for GP. • Used Nvidia 8800 GT • 112 pixel shaders @ 1.5GHz
FX Correlator • Each antenna 3 Steps, FFT and then the multiplication with every other antenna and then integrated • The Multiplication being the dominant area of computation was the function implemented on FPGA and GPU
Correlation Graphically[1] …… Freq 0 Freq M N^2/2 N^2/2 x int length N^2/2 x int length x Freq
FPGA Design • We were able to implement 96 floating point units. • Created pipelined engine that computes single output for three time steps and integrates • Could fit four of these engines so could compute for four frequencies at a time • Getting speedup ~ 3x vs. 3GHz Xeon (SSE). • Getting ~ 85% theoretical peak (excluding transfers). Freq 1 Freq 2 Freq 3 Freq 0 Clock cycle N2/2 Clock cycle 0 Clock cycle 1
GPU Design[1] • Works on thread parallelism. Each executes on a pixel shader. • Cuda uses light weight threads. • Created thread for each output (+ redundant ones) then integrated. • Getting speedup ~ 5x vs. 3GHz Xeon (SSE).
Findings • The GPUs vs Nallatech FPGA • GPU required considerably less effort, • Performed better, • Much cheaper ~20x • Still a lot of areas to squeeze out more performance. (Chris Harris). • In defense of FPGAs • Virtex 5 can achieve higher clock rate (up to 500MHz) • 96 multipliers on V4LX100 is not enough, V5SX240 has 1,056 • About 25% of the time was spent on transfers via older PCI-X bus. • More power efficient
References • [1] Chris Harris et al,The University of Western Australia (UWA), GPU Accelerated Radio Astronomy Signal Convolution, published in Experimental Astronomy, 2008