150 likes | 289 Views
A Concurrent Matrix Transpose Algorithm. Pourya Jafari. Application . Frequently Used Linear Algebra Operation Scientific Applications FFT Matrix Multiplication. Transpose Matrix. : item/cell at row i and column j of matrix B . For all i, j we have .
E N D
A Concurrent Matrix Transpose Algorithm Pourya Jafari
Application • Frequently Used Linear Algebra Operation • Scientific Applications • FFT • Matrix Multiplication
Transpose Matrix • : item/cell at row i and column j of matrix B • . • For all i, j we have • . • Simply exchange rows and columns • For simplicity we only consider square matrices • N row N columns labeled 0 to N-1
An Example • Each cell is filled with row|column number • 6 swaps, (4*4 – 4)/2 = 6 • In general, for size N square Matrix we have • swaps,
Parallelizing • Naïve algorithm • A thread for each swap • Quadratic number of threads • Quadratic number of communication links • → impractical
Parallelizing - 2 • More efficient Way • Assign a column to each thread • O(N) threads • Communication links? • Depends on the approach
Measure dislocation • A single swap operation as row and column shifts • For column shift length A • j= i + K → K = i - j • Shift length is i-j; value range is from 0 to N-1
Concurrency Scheme • Minimize communication • Pre-process inside thread • Shift each rows • Intra-process/thread communication • Shift each columns • Post-process inside thread • Shift each rows again
Concurrency Scheme - 2 • We have the row shifts fixed based on row index • Has range 0 to N-1, • consistent with our initial finding • Now arrange the rows, so that column shifts gets us to i • i - L = i’ L + i’ = i L = -j • So we shift each column j cells up
Steps so far • 1 → 2: Column shift j up • 2 → 3: Row shift based on row indices • 3 → 4: ? • Change of indices so far • (i - j, j) → (i - j, i - j + j) → (i - j, i) = (m, n) • One operation to change row index to j • n - m = (i - (i - j))= j (1) (2-a) (2-b) (3) (4)
Efficiency of algorithm so far • O(N) row and column operation • O(N2) overall considering both rows and column • O(N) communication links • Communication is a major bottleneck • Group row shifts • Reduce communication and overall complexity
Radix Representation • Radix r • Base r numbers • For k each digit place (starting from LS) • For l steps from 0 to r-1 • group all row shifts for current step • Radix 3 • Possible numbers 0, 1 and 2 • Second loop { For l=0 to 2 } • Shift all number have l in their kth digit place l*r^k to the right
Special Case: Radix-2 • Two steps only 0 and 1 • We only shift for 1 • Digits are bit representation • Shift all row indices have their kth bit on = + Shift for each row k=0 k=1
Algorithm complexity • Depends on r (radix) • C1=(r-1)[logrN] • C2=b(r-1)[N/r][logrN] • Special cases • r=2 • Important when communication cost is high • Good when message size small • r=N • Good when message size is large • Best value based on communication costs, message size, communication link performance, number of ports, etc.