1 / 29

Implementation of Polymorphic Matrix Inversion using Viva

Implementation of Polymorphic Matrix Inversion using Viva. Arvind Sudarsanam, Dasu Aravind Utah State University. Overview. Problem definition Matrix inverse algorithm Types of Polymorphism Design Set-up Hardware design flow (For LU Decomposition) Results Conclusions. Problem Definition.

wood
Download Presentation

Implementation of Polymorphic Matrix Inversion using Viva

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University

  2. Overview • Problem definition • Matrix inverse algorithm • Types of Polymorphism • Design Set-up • Hardware design flow (For LU Decomposition) • Results • Conclusions MAPLD2005/171

  3. Problem Definition • Given a 2-D matrix, A[N][N], A = A[1,1] A[1,2] A[1,3]…….. A[1,N] A[2,1] A[2,2] A[2,3]…….. A[2,N] A[3,1] A[3,2] A[3,3]…….. A[3,N] . A[N,1] A[N,2] A[N,3]…….. A[N,N] Determine the Inverse matrix A-1, defined as AxA-1= I MAPLD2005/171

  4. Algorithm flow • Step 1: LU Decomposition • Matrix A is split into two triangular matrices, L and U For i = 1:N For j = I+1:N A(j,i) = A(j,i)/A(I,i)); A(j,(i+1):N) = A(j,(i+1):N) - A(j,i)*A(i,(i+1):N); End For j End For i MAPLD2005/171

  5. Algorithm flow • Step 2: Inverse computation for triangular matrices • L-1 and U-1 are computed using a variation of Gaussian elimination For i = 1:N For j = i+1:N Linv(j,i+1:N) = Linv(j,i+1:N) - L(j,i)* Linv(i,i+1:N); End For j End For i MAPLD2005/171

  6. Algorithm flow • Step 3: Matrix multiplication • L-1 and U-1 are multiplied together to generate A-1 For i = 1:N For j = 1:N Ainv[i,j] = Ainv[i,j] +U[i,k]*L[k,j] End For j End For i MAPLD2005/171

  7. Types of Polymorphism • Following parameters can be varied for the input matrix: • Data type – variable precision, signed/unsigned, and float • Information rate – Rate at which input arrives into, and leaves the system (pipelining/parallelism) • Order tensor – matrix size (16x16, 32x32 etc.) MAPLD2005/171

  8. Polymorphism and Viva • Viva supports polymorphic hardware implementation, just as any software programming language. • A large library of polymorphic arithmetic, control and memory modules is available. MAPLD2005/171

  9. Data Type Polymorphism Poly-morphic MAPLD2005/171

  10. Information Rate Polymorphism Clock speed can be changed based on the input data rate This ‘Mul’ unit is a Truly polymorphic object. Based on the input list size, the Viva compiler will generate the required number of parallel multiplier units. The number of parallel units will be denoted as ‘K’ MAPLD2005/171

  11. Order Tensor Polymorphism Value of ‘N’ set at run time MAPLD2005/171

  12. Design Flow – Top level block diagram From Files Memory Unit for A Memory Unit for L Memory Unit for U Memory Unit for L-1 Central Control Unit (CCGU) Memory Unit for U-1 Loop Unit Loop Unit LU Decompose Inverse of L Loop Unit Loop Unit Inverse of U U-1X L-1 Memory Unit for A-1 MAPLD2005/171

  13. Design Flow MAPLD2005/171

  14. Design Flow MAPLD2005/171

  15. Hardware Design Set-up • Hardware: PE6 (Xilinx 2V6000 FPGA) of the Starbridge Hypercomputer, connected to an Intel x86 processor. (66 MHz / 33,768 Slices) • Software: Viva 2.3, developed at Starbridge Systems MAPLD2005/171

  16. Implementation – LU Decomposition Loop Unit Address Generation Unit Memory Unit i,j,k A[j,()],A[i,()], A[j,i], A[i,i] A[j,()], A[j,i] Computation Unit i,j,k MAPLD2005/171

  17. Loop Unit - Functionality Given the order of the matrix ‘N’ and the parallelism to be supported ‘K’, The following loop structure needs to be generated. For i = 1 to N For k = ((i-1)/K)*K to N+1-K in steps of K For j = i to N Generate(i,k,j); End j End k End i MAPLD2005/171

  18. Loop Unit - Architecture A simple register-based implementation is shown. The overall latency is 2 Clock cycles. MAPLD2005/171

  19. Memory Unit - Distribution One Block MAPLD2005/171

  20. Memory Unit - Architecture • BRAM memories are used to store data internally. (Matrix is expected to fit into the BRAMs. Maximum value of N is 128) • There are ‘K’ [(NxN)/K]x(variable Data Size) individual BRAMs. • The ‘K’ values in each block in Matrix is distributed over the ‘K’ BRAMs. This results in a single clock access time for internal memory. • A[j] and A[j,i] will be fetched one after the other on every iteration. • The overall latency was found to be 3 clock cycles. MAPLD2005/171

  21. Address Generation - Functionality • Inputs: i,j,k from the Loop Unit • Outputs: Address in the BRAM for the A[j,()] and A[i,()] blocks of data • Address in the BRAM of A[j,i] and A[i,i] • The computations have been organized in such a way that A[i,()] needs to be fetched only once for processing a complete column of blocks. • Thus, only one port is required to access both A[i,()] and A[j,()] MAPLD2005/171

  22. Address Generation - Architecture ‘Shift’ used instead of multipliers: N,K assumed to be powers of 2. (Latency = 1 cc) MAPLD2005/171

  23. Computation Units - Functionality • Inputs: - A[j,()] and A[i,()] blocks from BRAM unit • - A[j,i] and A[i,i] from the BRAM unit. • - Indices i,j,k from the loop unit. • Output: The modified A[j,()] block and the A[j,i] value. • Three steps are performed: • Modify A[i,()] based on the loop indices • Perform computations: Divide, Multiply, Subtract • Include A[j,i] on A[j,()] if required MAPLD2005/171

  24. Computation Units – Architecture (K=8) MAPLD2005/171

  25. Results for LUD – Slice Counts (N=16) Number of ROM multipliers used shown in brackets. MAPLD2005/171

  26. Results for LUD – Time Taken (in cycles) MAPLD2005/171

  27. Time taken Vs Size of Matrix (Fix16, K = 8) A ‘C’ code (N=128;Fix16) will take O(M*N3) time ~ 702545*M ns (where ‘M’ is number of cycles per iteration ~ 30) (On Intel Centrino 1.5GHz) ~ M/6 speed-up MAPLD2005/171

  28. Conclusions • A polymorphic design for matrix inverse was implemented • Data type - Float/Fix16/Fix32 • Information rate (K) - 4/8/16 • Order Tensor (N) – 16/32/64/128 • Viva’s effectiveness in polymorphic implementation was evaluated. • Hardware design flow and Results were shown for LU Decomposition. MAPLD2005/171

  29. Lessons learned • Pseudo polymorphism • Some of the polymorphic objects in the Viva library are pseudo polymorphic. For e.g. floating point and fixed point implementations of adder unit. • Need for timing analysis tool • It was difficult to compute the delays associated with each block in the Viva library • Fix32 Vs Float • The division unit in the Viva library is optimized for Floating point and not for fixed point (as shown in the results) MAPLD2005/171

More Related