Technion – Israel Institute of Technology Department of Electrical Engineering

Written by: Haim Natan Benny Pano Supervisor: Gregory Mironov Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Inverse Matrix Accelerator Final Presentation Project no. D0623 Spring 2004

Project Background Nowadays complex computations are done on a standard processor or a DSP which aren’t optimal for the matrix inversion. In order to decrease the time spent on matrix inversion tasks we use a specific hardware to do the matrix inversion leaving the CPU free for other tasks and using the faster hardware for the complex computation.

Project Goal Designing and implementing an FPGA circuitry that inverses a 625x625 matrix.

Project Requirements • A standalone system • The matrix is of size 625x625 • Matrix elements are of type 64 bits double precision floating point • Calculation time < 20ms

Suggested Solutions • Two algorithms were considered: • Linear algorithm of order O(N^3) • Monte-Carlo algorithm of order O(N^2) • The selected hardware was Virtex II Pro • The selected algorithm was the • Monte-Carlo

N – number of markov chains T – length of each chain b – an inversed element MP() – a chain generator bi,j := 0; For c := 1 to N do { k0 := i ; w0 := 1 ; For t := 1 to T do { kt := MP( kt-1 ) ; wt := sign(dkt-1,kt) * wt-1 * Ekt ; if kt = j then bi,j += wt ; } } bi,j /= N ; The Monte-Carlo Algorithm (simplified version)

The MC Algorithm (continued) • D = I – A • Ei =Σj|di,j| - weights vector • P is a transition probability matrix such that pi,j= |di,j| /Ei - used for generating the marcov chains.

A Small Demonstration A = D = E = P = t rand# kt wt b1,2 0 none 1 1 0 1 0.2 1 -8 0 2 0.9 2 -48 -48 3 0.49 1 -384 -48

T k = i MP MP MP E1 SW SW SW En SW SW SW 0 bi,j A A A Algorithm’s Architecture

SW A Kin Tin Tin Ein Eout Win Wout * Kin Wint Rin Rout Cin Cout Kout Tout Vin Vout Switch & Accumulator Eout = Ein Rout = Rin Kout = Kin If Rin = Kin Then Tout = Ein Else Tout = Tin Cout = Cin Wout = Win * Tin Wint = Wout If Cin = Kin Then Vout = Vin + Wint Else Vout = Vin

Architecture Demonstration k = 1 MP MP MP Kout = 1 Kout = 2 Kout = 1 E1 = 8 SW SW SW Tout=8 Tout=8 E2 = 6 SW SW SW Tout=6 b1,2 = 0 Wout=-8 Wout=-48 Wout=-384 A A A Vout=0 Vout=-48 Vout=-48

Memory Controller RAM Basic Block Diagram FPGA A Elements request Algorithm B Read/Write Elements transfer

Some scales • 64bit * 625 * 625 = 3MB • Two matrices needed  6MB • 20[msec] / (625^2) = 51.2 [nsec] per one matrix element  20Mhz • Considering an O(n^3) algorithm  12.2[Ghz]

Encountered obstacles • Studying the Monte-Carlo algorithm and some of its mathematical basics. • The architecture requires a lot of FPGA cells. • Finding a floating point library and adjusting it to our needs. • Getting to know all the software used in an FPGA development

Encountered obstacles (Cont.) • The floating point units have a big delay time (130ns for the Division unit alone) • Monte-Carlo algorithm needs a delicate tuning and a lot of iterations for achieving a reasonable accuracy • A very large bus is needed in order to transfer the matrix elements.

Project achievements • Studied the Monte-Carlo algorithm and its architecture. • Wrote a C simulation in order to check the Monte-Carlo method. • Studied the VHDL language. • Found and adjusted a floating point library to the project needs. • Ran a simulation for the floating point unit.

Project achievements (cont.) • Implemented the switch and accumulator blocks in VHDL. • Implemented a basic chain using the switch and accumulator block. • Implemented and loaded to the V2P a circuit that used the floating point library.

Things to do • Implement the MP block, the memory controller and the computation control circuit. • Improve FP delays • Design a communication interface to load and send the matrix.

Technion – Israel Institute of Technology Department of Electrical Engineering