Reducing number of operations: The joy of algebraic transformations

Reducing number of operations:The joy of algebraic transformations CS498DHP Program Optimization

Number of operations and execution time • Fewer number of operations does not necessarily mean shorter execution times. • Because of scheduling in a parallel environment. • Because of locality. • Because of communication in a parallel program. • Nevertheless, although it has to be applied carefully, reducing the number of operations is one of the important optimizations. • In this presentation, we discuss transformation to reduce the number of operations or reduce the length of scheduling in an idealized parallel environment where communication costs are zero.

+ + h a * b + g + c * * f d e Scheduling • Consider the expression tree: • It can be shortened by applying • Associativity and commutativity: [a+h+b*(c+g+d*e*f) ] or • Associativity, commutativity and distributivity: [a+h+b*c+b*g+b*d*e*f]. • The second expression is the sortest of the three. This means that with enough resources the third expression is the fastest although is has the most operations.

Locality do i=1,n d(i) = a(i)/b(i) c(i) = a(i)+b(i)+d(i) end do … do i=1,n x(i) = (a(i)+b(i))*t(i)+d(i) end do • Consider: do i=1.n c(i) = a(i)+b(i)+a(i)/b(i) end do … do i=1,n x(i) = (a(i)+b(i))*t(i)+a(i)/b(i) end do • The sequence on the right executes fewer operations, but, if n is large enough, it also incurs in more cache misses. (We assume that t is computed between the two loops so that they cannot be fused.)

Communication in parallel programs cobegin … do i=1,n a(i) = .. end do … // … do i=1,n a(i) = .. end do … coend • Consider: cobegin … do i=1,n a(i) = .. end do send a(1:n) … // … receive a(1:n) … coend • The sequence on the right executes more operation s, but it would execute faster if the send operation is expensive.

Approaches to reducing cost of computation • Eliminate (syntactically) redundant computations. • Apply algebraic transformations to reduce the number of operations. • Decompose sequential computations for parallel execution. • Apply algebraic transformations to reduce the height of expressions trees and thus reduce execution time in a parallel environment.

Elimination of redundant computations • Many of the transformations were discussed in the context of compiler transformations. • Common subexpression elimination • Loop invariant removal • Elimination of redundant counters • Loop unrolling (not discussed, but should have). It eliminates bookkeeping operations.

However, compilers will not eliminate all redundant computations. Here is an example where user intervention is needed: The following sequence do i=1,n s = a(i)+s end do … do i=1,n-1 t = a(i)+t end do …t…

May be replaced by do i=1,n-1 t = a(i)+t end do s=t+a(n) … …t… This transformation is not usually done by compilers.

Another example, from C, is the loop for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { a[i,j]=0; } } Which, if a is n × n, can be transformed into the loop below that has fewer bookkeeping operations. b=a; for (i = 0; i < n*n; i++) { *b=0; b++; }

Applying algebraic transformations to reduce the number of operations • For example, the expressions a*(b*c)+(b*a)*d+a*e can be transformed into (a*b)*(c+d)+a*e by distributivity and then by associativity and distributivity into a*(b*(c+d)+e). • Notice that associativity has to be applied with care. For example, suppose we are operating on floating point values and that x is very much larger than y and z=-x. Then (y+x)+z may give 0 as a result, while y+(x+z) gives y as an answer.

The application of algebraic rules can be very sophisticated. Consider the computation of xn. A naïve implementation would require n-1 multiplications. • However, if we represent n in binary as n=b0+2(b1+2(b2 + …)) and notice that xn=xb0 (xb1+2(b2 + …))2, the number of multiplications can be reduced to O(log n).

function power(x,n) (assume n>0) if n==1 then return x if n%2==1 then return x*power(x,n-1) else x=power(x,n/2); return x*x

Horner’s rule • A polynomial A(x) = a0 + a1x + a2x² + a3x³ + ... may be written as A(x) = a0 + x(a1 + x(a2 + x(a3 + ...))). As a result, a polynomial may be evaluated at a point x', that is A(x') computed, in Θ(n) time using Horner's rule. That is, repeated multiplications and additions, rather than the naive methods of raising x to powers, multiplying by the coefficient, and accumulating.

Conventional matrix multiplication Asymptotic complexity: 2n3 operationsEach recursion step (blocked version): 8 multiplications, 4 additions

Strassen’s Algorithm Asymptotic complexity: O(nlog27) = O(n2.8…) operationsEach recursion step: 7 multiplications, 18 additions/subtractions Asymptotic complexity is solution of T(n)=7T(n/2)+18(n/2)2

Winograd Asymptotic complexity: O(n2.8..)operationsEach recursion step: 7 multiplications, 15 additions/subtractions

Parallel matrix multiplication • Parallel matrix multiplication can be accomplished without redundant operations. • First observe that the time to compute a sum of n elements, given enough resources, is .

Time:

With sufficient replication and computational resources matrix multiplication can take just one multiplication step and additions

Copying can also be done in logarithmic steps

Parallelism and redundancy • Algebra rules can be applied to reduce tree height. • In some cases, the height of the tree is reduced at the expense of an increase in the number of operations

Parallel Prefix

Redundancy in parallel sorting.Sorting networks.

Comparator (2-sorter) outputs inputs min(x, y) x max(x, y) y

1 0 0 0 0 1 0 0 1 0 1 1 0 1 1 1 d stages Comparison Network n / 2 comparisons per stage

sorted Sorting Networks inputs outputs 1 0 0 0 0 0 Sorting Network 1 0 0 1 0 1 1 1 1 1

Insertion Sort Network inputs outputs depth2n 3

Reducing number of operations: The joy of algebraic transformations