A Concurrent Matrix Transpose Algorithm

A Concurrent Matrix Transpose Algorithm Pourya Jafari

Application • Frequently Used Linear Algebra Operation • Scientific Applications • FFT • Matrix Multiplication

Transpose Matrix • : item/cell at row i and column j of matrix B • . • For all i, j we have • . • Simply exchange rows and columns • For simplicity we only consider square matrices • N row N columns labeled 0 to N-1

An Example • Each cell is filled with row|column number • 6 swaps, (4*4 – 4)/2 = 6 • In general, for size N square Matrix we have • swaps,

Parallelizing • Naïve algorithm • A thread for each swap • Quadratic number of threads • Quadratic number of communication links • → impractical

Parallelizing - 2 • More efficient Way • Assign a column to each thread • O(N) threads • Communication links? • Depends on the approach

Measure dislocation • A single swap operation as row and column shifts • For column shift length A • j= i + K → K = i - j • Shift length is i-j; value range is from 0 to N-1

Concurrency Scheme • Minimize communication • Pre-process inside thread • Shift each rows • Intra-process/thread communication • Shift each columns • Post-process inside thread • Shift each rows again

Concurrency Scheme - 2 • We have the row shifts fixed based on row index • Has range 0 to N-1, • consistent with our initial finding • Now arrange the rows, so that column shifts gets us to i • i - L = i’ L + i’ = i L = -j • So we shift each column j cells up

Steps so far • 1 → 2: Column shift j up • 2 → 3: Row shift based on row indices • 3 → 4: ? • Change of indices so far • (i - j, j) → (i - j, i - j + j) → (i - j, i) = (m, n) • One operation to change row index to j • n - m = (i - (i - j))= j (1) (2-a) (2-b) (3) (4)

Efficiency of algorithm so far • O(N) row and column operation • O(N2) overall considering both rows and column • O(N) communication links • Communication is a major bottleneck • Group row shifts • Reduce communication and overall complexity

Radix Representation • Radix r • Base r numbers • For k each digit place (starting from LS) • For l steps from 0 to r-1 • group all row shifts for current step • Radix 3 • Possible numbers 0, 1 and 2 • Second loop { For l=0 to 2 } • Shift all number have l in their kth digit place l*r^k to the right

Special Case: Radix-2 • Two steps only 0 and 1 • We only shift for 1 • Digits are bit representation • Shift all row indices have their kth bit on = + Shift for each row k=0 k=1

Algorithm complexity • Depends on r (radix) • C1=(r-1)[logrN] • C2=b(r-1)[N/r][logrN] • Special cases • r=2 • Important when communication cost is high • Good when message size small • r=N • Good when message size is large • Best value based on communication costs, message size, communication link performance, number of ports, etc.

Radix vs. message size vs. index time for 64 processors

A Concurrent Matrix Transpose Algorithm

A Concurrent Matrix Transpose Algorithm

Presentation Transcript

Algorithm-Based Fault Tolerance for Matrix Operations

A* Pathfinding Algorithm

Contents Introduction Matrix Multiplication Partitioned Matrices Powers of a Matrix Transpose of a Matrix Theorems and P

An Introduction to Proc Transpose

Concurrent Security, A Survey

Matrix Inverse and Transpose

Prim’s Algorithm from a matrix

BIT MATRIX TRANSPOSE WITH TENSOR PRODUCT AND PERFECT SHUFFLING FOR SOFTWARE DEFINED RADIO

Enhanced matrix multiplication algorithm for FPGA

Strassen Matrix Multiplication Algorithm

Secure communication of multimedia through encryption using matrix algorithm

The Diagonalized Newton Algorithm for Non-negative Matrix Factorization

A Concurrent Logical Framework

A* Search Algorithm

Algorithm-Based Fault Tolerance Matrix Multiplication

Algorithm Programming 1 89-210 Concurrent Programming in Java

concurrent

A Concurrent Matrix Transpose Algorithm, The Verification

Lock-Free concurrent algorithm for Linked lists: Verification

A Two-Lock Concurrent Queue Algorithm

A matrix density based algorithm to hierarchically co-cluster documents and words

A* algorithm