260 likes | 359 Views
Distributed Linear Algebra. Peter L. Montgomery Microsoft Research, Redmond, USA RSA 2000 January 17, 2000. Role of matrices in factoring n. Sieving finds many x j 2 p i e ij (mod n ). Raise j th to power s j = 0 or 1, multiply.
E N D
Distributed Linear Algebra Peter L. Montgomery Microsoft Research, Redmond, USA RSA 2000 January 17, 2000
Role of matrices in factoring n • Sieving finds many xj2 pieij (mod n). • Raise jth to power sj = 0 or 1, multiply. • Left side always a perfect square. Right is a square if exponents jeijsj are even for all i. • Matrix equation Es 0 (mod 2), E known. • Knowing x2y2 (mod n), test GCD(xy, n). • Matrix rows represent primes pi. Entries are exponents eij. Arithmetic is over GF(2).
RSA–140 Jan-Feb, 1999 4 671 181 4 704 451 Weight 151 141 999 Omit primes < 40 99 Cray-C90 hours 75% of 800 Mb for matrix storage RSA–155 August, 1999 6 699 191 6 711 336 Weight 417 132 631 Omit primes < 40 224 Cray-C90 hours 85% of 1960 Mb for matrix storage Matrix growth on RSA Challenge
Regular Lanczos • A positive definite (real, symmetric) nn matrix. • Given b, want to solve Ax = b for x. • Set w0 = b. • wi+1= Awi – Σ0ji cijwj if i 0 • cij = wjTA2wi / wjTAwj • Stop when wi+1 = 0.
Claims • wjTAwj 0 if wj0(A is positive definite). • wjTAwi = 0 whenever i j (by choice of cij and symmetry of A). • Eventually somewi+1= 0, say for i = m (otherwise too many A-orthogonal vectors). • x = Σ0jm(wjT b / wjTAwj) wj satisfies Ax=b(error u=Ax–b is in space spanned by wj’s but orthogonal to all wj, so uTu=0 and u=0).
Simplifying cijwhen i > j+1 • wjTAwjcij = wjTA2wi = (Awj)T (Awi) = (wj+1 + linear comb. of w0 to wj)T (Awi) = 0 (A-orthogonality). • Recurrence simplifies to wi+1= Awi –ciiwi – ci,i–1wi–1 when i 1. • Little history to save as i advances.
Major operations needed • Pre-multiply wi by A. • Inner products such as wjTAwj and wjTA2wi = (Awj)T (Awi). • Add scalar multiple of one vector to another.
Adapting to Bx=0over GF(2) • B is n1n2 with n1 n2, not symmetric. Solve Ax = 0 where A = BTB. A is n2n2. BT has small nullspace in practice. • Right side zero, so Lanczos gives x = 0. SolveAx = Ay where y is random. • uTu and uTAu can vanish when u0. Solved by Block Lanczos (Eurocrypt 1995).
Block Lanczos summary • Let N be the machine word length (typically 32 or 64) or a small multiple thereof. • Vectors are n1N or n2N over GF(2). • Exclusive OR and other hardware bitwise instructions operate on N-bit data. • Recurrences similar to regular Lanczos. • Approximately n1/(N–0.76) iterations. • Up to N independent solutions of Bx=0.
Block Lanczos major operations • Pre-multiply n2N vector by B. • Pre-multiply n1N vector by BT. • NN inner product of two n2N vectors. • Post-multiply n2N vector by NN matrix. • Add two n2N vectors. How do we parallelize these?
Assumed processor topology • Assume a g1g2 toroidal grid of processors. • A torus is a rectangle with its top connected to its bottom, and left to right (doughnut). • Need fast communication to/from immediate neighbors north, south, east, and west. • Processor names are prc where r is modulo g1 and c is modulo g2. • Set gridrow(prc) = r and gridcol(prc) = c.
A torus of processors P1 P2 P3 P4 P5 P6 P7 P8 P9 Example: 3x3 torus system
Matrix row and column guardians • For 0 i n1, a processor rowguard(i) is responsible for entry i, in all n1N vectors. • For 0 j n2, a processor colguard(j) is responsible for entry j, in all n2N vectors. • Processor-assignment algorithms aim for load balancing.
Three major operations • Vector addition is pointwise. When adding two n2 N vectors, processor colguard(j) does the j-th entries. Data is local. • Likewise for n2N vector by NN matrix. • Processors form partial NN inner products. Central processor sums them. • These operations need little communication. • Workloads are O(#columns assigned).
Allocating B among processors • Let B = (bij) for 0 i n1 and 0 j n2. • Processor prc is responsible for all bij where gridrow(rowguard(i)) = r and gridcol(colguard(j)) = c. • When pre-multiplying by B, the input data from colguard(j) will arrive along grid column c, and the output data for rowguard(i) will depart along grid row r.
Multiplying u = Bv where u is n1N and v is n2N • Distribute each v[j] to all prc with gridcol(colguard(j)) = c. That is, broadcast each v[j] along one column of the grid. • Each prc processes all of its bij, building partial u[i] outputs. • Partial u[i] values are summed as they advance along a grid row to rowguard(i). • Individual workloads depend upon B.
Actions by prc during multiply • Send/receive all v[j] with gridcol(colguard(j)) = c. • Zero all u[i] with rowguard(i) = pr,c+1. • At time t where 1 t g2, adjust all u[i] with rowguard(i) = pr,c+t (t nodes east). • .If t g2, ship these u[i] west to pr,c–1and receive other u[i] from pr,c+1 on the east. • Want balanced workloads at each t.
Multiplication by BT • Reverse roles of matrix rows and columns. • Reverse roles of grid rows and columns. • BT and B can share storage since same processor handles (B)ij during multiply by B as handles (BT)ji during multiply by BT.
Major memory requirements • Matrix data is split amongst processors. • With 6553665536 cache-friendly blocks, an entry needs only two 16-bit offsets. • Each processor needs one vector of length max(n1/g1, n2/g2) and a few of length n2/g1g2, with N bits per entry. • Central processor needs one vector of length n2 plusrowguard and colguard.
Major communications during multiply by B • Broadcast each v[j] along entire grid column. Ship n2N bits to each of g1–1 destinations. • Forward partial u[i] along grid row, one node at a time. Total (g2–1)n1N bits. • When n2 n1, communication for B and BT is 2(g1+g2–2)n1N bits per iteration. • 2(g1+g2–2)n12 bitsafter n1/N iterations.
Choosing grid size • Large enough that matrix fits in memory. • Matrix storage is about 4w/g1g2 bytes per processor, where w is total matrix weight. • Try to balance I/O and computation times. • Multiply cost is O(n1w/g1g2) per processor. • Communications cost O((g1+g2–2)n12). • Prefer a square grid, to reduce g1+g2.
Choice of N and matrix • Prefer smaller but heavier matrix if it fits, to lessen communications. • Higher N yield more dependencies, letting you omit the heaviest rows from the matrix. • Larger N means fewer but longer messages. • Size of vector elements affects cache. • When N is large, inner products and post-multiplies by NN matrices are slower.
Cambridge cluster configuration • Microsoft Research, Cambridge, UK. • 16 dual-CPU 300 MHz Pentium II’s. • Each node • 384 MB RAM • 4 GB local disk • Networks • Dedicated fast ethernet (100 Mb/sec) • Myrinet, M2M-OCT-SW8 (1.28 Gb/sec)
Message Passing Interface (MPI) • Industry Standard • MPI implementations: • exist for the majority of parallel systems & interconnects • public domain (e.g. mpich) or commercial (e.g. MPI PRO) • Supports many communications primitives including virtual topologies (e.g. torus).