Distributed Linear Algebra

Distributed Linear Algebra Peter L. Montgomery Microsoft Research, Redmond, USA RSA 2000 January 17, 2000

Role of matrices in factoring n • Sieving finds many xj2  pieij (mod n). • Raise jth to power sj = 0 or 1, multiply. • Left side always a perfect square. Right is a square if exponents jeijsj are even for all i. • Matrix equation Es  0 (mod 2), E known. • Knowing x2y2 (mod n), test GCD(xy, n). • Matrix rows represent primes pi. Entries are exponents eij. Arithmetic is over GF(2).

RSA–140 Jan-Feb, 1999 4 671 181  4 704 451 Weight 151 141 999 Omit primes < 40 99 Cray-C90 hours 75% of 800 Mb for matrix storage RSA–155 August, 1999 6 699 191  6 711 336 Weight 417 132 631 Omit primes < 40 224 Cray-C90 hours 85% of 1960 Mb for matrix storage Matrix growth on RSA Challenge

Regular Lanczos • A positive definite (real, symmetric) nn matrix. • Given b, want to solve Ax = b for x. • Set w0 = b. • wi+1= Awi – Σ0ji cijwj if i 0 • cij = wjTA2wi / wjTAwj • Stop when wi+1 = 0.

Claims • wjTAwj  0 if wj0(A is positive definite). • wjTAwi = 0 whenever i  j (by choice of cij and symmetry of A). • Eventually somewi+1= 0, say for i = m (otherwise too many A-orthogonal vectors). • x = Σ0jm(wjT b / wjTAwj) wj satisfies Ax=b(error u=Ax–b is in space spanned by wj’s but orthogonal to all wj, so uTu=0 and u=0).

Simplifying cijwhen i > j+1 • wjTAwjcij = wjTA2wi = (Awj)T (Awi) = (wj+1 + linear comb. of w0 to wj)T (Awi) = 0 (A-orthogonality). • Recurrence simplifies to wi+1= Awi –ciiwi – ci,i–1wi–1 when i 1. • Little history to save as i advances.

Major operations needed • Pre-multiply wi by A. • Inner products such as wjTAwj and wjTA2wi = (Awj)T (Awi). • Add scalar multiple of one vector to another.

Adapting to Bx=0over GF(2) • B is n1n2 with n1 n2, not symmetric. Solve Ax = 0 where A = BTB. A is n2n2. BT has small nullspace in practice. • Right side zero, so Lanczos gives x = 0. SolveAx = Ay where y is random. • uTu and uTAu can vanish when u0. Solved by Block Lanczos (Eurocrypt 1995).

Block Lanczos summary • Let N be the machine word length (typically 32 or 64) or a small multiple thereof. • Vectors are n1N or n2N over GF(2). • Exclusive OR and other hardware bitwise instructions operate on N-bit data. • Recurrences similar to regular Lanczos. • Approximately n1/(N–0.76) iterations. • Up to N independent solutions of Bx=0.

Block Lanczos major operations • Pre-multiply n2N vector by B. • Pre-multiply n1N vector by BT. • NN inner product of two n2N vectors. • Post-multiply n2N vector by NN matrix. • Add two n2N vectors. How do we parallelize these?

Assumed processor topology • Assume a g1g2 toroidal grid of processors. • A torus is a rectangle with its top connected to its bottom, and left to right (doughnut). • Need fast communication to/from immediate neighbors north, south, east, and west. • Processor names are prc where r is modulo g1 and c is modulo g2. • Set gridrow(prc) = r and gridcol(prc) = c.

A torus of processors P1 P2 P3 P4 P5 P6 P7 P8 P9 Example: 3x3 torus system

Matrix row and column guardians • For 0  i  n1, a processor rowguard(i) is responsible for entry i, in all n1N vectors. • For 0  j  n2, a processor colguard(j) is responsible for entry j, in all n2N vectors. • Processor-assignment algorithms aim for load balancing.

Three major operations • Vector addition is pointwise. When adding two n2 N vectors, processor colguard(j) does the j-th entries. Data is local. • Likewise for n2N vector by NN matrix. • Processors form partial NN inner products. Central processor sums them. • These operations need little communication. • Workloads are O(#columns assigned).

Allocating B among processors • Let B = (bij) for 0  i  n1 and 0  j  n2. • Processor prc is responsible for all bij where gridrow(rowguard(i)) = r and gridcol(colguard(j)) = c. • When pre-multiplying by B, the input data from colguard(j) will arrive along grid column c, and the output data for rowguard(i) will depart along grid row r.

Multiplying u = Bv where u is n1N and v is n2N • Distribute each v[j] to all prc with gridcol(colguard(j)) = c. That is, broadcast each v[j] along one column of the grid. • Each prc processes all of its bij, building partial u[i] outputs. • Partial u[i] values are summed as they advance along a grid row to rowguard(i). • Individual workloads depend upon B.

Actions by prc during multiply • Send/receive all v[j] with gridcol(colguard(j)) = c. • Zero all u[i] with rowguard(i) = pr,c+1. • At time t where 1 t  g2, adjust all u[i] with rowguard(i) = pr,c+t (t nodes east). • .If t  g2, ship these u[i] west to pr,c–1and receive other u[i] from pr,c+1 on the east. • Want balanced workloads at each t.

Multiplication by BT • Reverse roles of matrix rows and columns. • Reverse roles of grid rows and columns. • BT and B can share storage since same processor handles (B)ij during multiply by B as handles (BT)ji during multiply by BT.

Major memory requirements • Matrix data is split amongst processors. • With 6553665536 cache-friendly blocks, an entry needs only two 16-bit offsets. • Each processor needs one vector of length max(n1/g1, n2/g2) and a few of length n2/g1g2, with N bits per entry. • Central processor needs one vector of length n2 plusrowguard and colguard.

Major communications during multiply by B • Broadcast each v[j] along entire grid column. Ship n2N bits to each of g1–1 destinations. • Forward partial u[i] along grid row, one node at a time. Total (g2–1)n1N bits. • When n2 n1, communication for B and BT is 2(g1+g2–2)n1N bits per iteration. • 2(g1+g2–2)n12 bitsafter n1/N iterations.

Choosing grid size • Large enough that matrix fits in memory. • Matrix storage is about 4w/g1g2 bytes per processor, where w is total matrix weight. • Try to balance I/O and computation times. • Multiply cost is O(n1w/g1g2) per processor. • Communications cost O((g1+g2–2)n12). • Prefer a square grid, to reduce g1+g2.

Choice of N and matrix • Prefer smaller but heavier matrix if it fits, to lessen communications. • Higher N yield more dependencies, letting you omit the heaviest rows from the matrix. • Larger N means fewer but longer messages. • Size of vector elements affects cache. • When N is large, inner products and post-multiplies by NN matrices are slower.

Cambridge cluster configuration • Microsoft Research, Cambridge, UK. • 16 dual-CPU 300 MHz Pentium II’s. • Each node • 384 MB RAM • 4 GB local disk • Networks • Dedicated fast ethernet (100 Mb/sec) • Myrinet, M2M-OCT-SW8 (1.28 Gb/sec)

Message Passing Interface (MPI) • Industry Standard • MPI implementations: • exist for the majority of parallel systems & interconnects • public domain (e.g. mpich) or commercial (e.g. MPI PRO) • Supports many communications primitives including virtual topologies (e.g. torus).

Performance datafrom MSR Cambridge cluster

Distributed Linear Algebra

Distributed Linear Algebra

Presentation Transcript

Linear Algebra

Linear algebra: matrices

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Secure Linear Algebra

Linear Algebra

Linear algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

LINEAR ALGEBRA

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

Linear Algebra

LINEAR ALGEBRA