210 likes | 466 Views
B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan. FPGA based Acceleration of Linear Algebra Computations. Outline. Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions
E N D
B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan FPGA based Acceleration of Linear Algebra Computations.
Outline • Double Precision Dense Matrix-Matrix Multiplication. • Motivation • Related Work • Algorithm • Design • Results • Conclusions • Double Precision Sparse Matrix-Vector Multiplication. • Introduction • Prasanna • DeLorimier • David Gregg et. al. • What can we do ?
FPGA based Double Precision Dense Matrix-Matrix Multiplication.
Motivation • FPGAs have been making inroads for HiPC. • Accelerating BLAS-3 achieved by accelerating matrix multiplications. • Modern FPGAs provide an abundance of resources – We must capitalise upon these.
Related Work{1/2} • The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak. • Dou : • Optimised for a large VirtexII pro device (Xillinx). • Created his own MAC (Not fully compliant). • Sub-block dimensions must be powers of 2. • Optimised for Low IO bandwidth.
Related Work{2/2} • Prasanna: • Scaling results in speed degradation of about 35% (2 PEs to 20 PEs). • 2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50). • For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs. • They state they have not made any platform specific optimisations, for the implemented design.
Algorithm • Broadcast ‘A’, keep a unique ‘B’ per PE • Multiply, and put in pipeline of multiplier. • Output is fed to directly to Adder+Ram (accumulator) • When the updated C is ready, take them out.
FPGA Synthesis/PAR data{1/2} Table: Resource Utilisation for SX95T and SX240T (post PAR) Table: Clock Speed in MHz for the overall design for different number of PE.
FPGA Synthesis/PAR data{2/2} Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)
Conclusions • We propose a variation of the rank one update algorithm for matrix multiplication. • We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA • The two designs clearly show the difference of local storage on IO bandwidth. • The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.
FPGA based Double Precision Sparse Matrix-Vector Multiplication.
Introduction • There are three main papers we will be looking at • Viktor Prasanna: Hybrid method use HLL+S/W+HDL • Michael DeLorimier: Maximum performance but unrealistic • David Gregg et. al.: Most realistic assumptions wrt DRAM
Prasanna • Use of prexisting IP cores – specifically for iterative solver (CG) • 4 input reduction ckt does dot product results in partial sums as op. • Adder loop with Array does summation of dotproduct – created using HLL • Reduction ckt at the end uses B-Tree to create the final value • IP s are available • DRAM looked at – but not realistically • Order of Matrices is small • DRAM is bottleneck • With their IP's they have a good architecture -however change the IP and modify datapath – eg. Dou MAC
DeLorimier • Use BRAMs for everything. • Use for iterative Solver – specifically CG • MAC requires interleaving • They do load balancing in their partitioner which requires – a communication stage, very matrix/partitioner dependent. • Communication is the bottleneck • Performance:750 MFLOPS / processor • 16 Virtex II 6000s • Each has 5 PE + 1 CE
David Gregg et. al. (SPAR) • They only report the use of the SPAR architecture for FPGAs • They use very pessimistic DRAM access times. Emphasis on cache-miss removal • Not using their Block RAMs well – maybe something interesting can be done here • 128 MFLOPS for 3 parallel SPAR units but remove cache miss and we get a peak of 570 MFLOPS
What can we do ? • Both use CSR – Not required why not modify representation • Two approaches : We can try both simultaneously • Prasanna – split across dot products (same row many PE) • Delorimier – split accross rows (many rows – one PE) • Use data from SPAR – viable approach – both do zero multiplies – we get away with one zero multiply/coloumn • Minimise communication or overlap it. - we can do interleaving for this – while one stage computes the previous one communicates.
Thank You Thank You