CS420 lecture six Loops

CS420 lecture sixLoops

Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop body takes constant time 2. loop body is executed times

Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest convex polygon surrounding them all.

Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest polygon surrounding them all. The problem reduces to finding line segments connecting points of the set.

Convex hull

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line.

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. How do you find out all other points are on the same side?

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(pi,pj,pk)

Convex hull: first attempt for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(pi,pj,pk) check is O(1) so this algorithm is O(n3)

the question that drives us.....

is there a better algorithm • Find lowest point P1 • Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn • Start with P1-P2 in current hull • for i from 3 to n • add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break

is there a better algorithm • Find lowest point P1 • Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn • Start with P1-P2 in current hull • for i from 3 to n • add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break 4 3 2 1

Complexity? • find lowest: O(n) • sort O(nlgn) • nested add/eliminate loop outer: i from 3 to n inner: j from i-1 downto 3 O(?)

nested add/eliminate loop • O(N) !! why? • n-2 points considered in i loop j loop either eliminates a point, ie it will not be checked again, or stops. The total number of points considered in all j loop iterations is therefore O(n) • Complex hull algorithm complexity O(nlgn)

is there a better algorithm? • no

is there a better algorithm? • no, argument is harder (lower bound arguments usually are) • it can be shown that sorting can be reduced to convex hull (reduced: translated such that when the convex hull problem is solved the original sorting problem is solved) and we have shown that sorting is Ω(nlgn)

3,9 2,4 1,1 reduction: x x,x2 sort({3, 1, 2})  convex hull({(3,9), (2,4), (1,1)})

Sub-O Optimizations • Suppose you have written an asymptotically optimal program, and still want to speed it up. • Using a profiler identify which parts of your code are the hotspots of your program. • 10/90 rule of thumb: 90% of the time is spent in 10% of the code: hotspots • Usually some of the innermost loops • Only improve the hotspots. Leave the rest clear and simple.

Data reorganization • Create sentinel (value at boundary) to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++;

Data reorganization • Create sentinel to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++; • Sentinel: value at boundary x[n]=T; i=0; while (x[i]!=T)i++; found = (i<n);

Loop unrolling • Loop unrolling is textually repeating the loop body so that the loop control is executed fewer times • Eg, a median filter operator on an image executes a 3x3 inner loop for each resulting pixel; this can be fully unrolled • some compilers (eg CUDA) allow unroll kpragmas • in a linked list, if the last element points at itself, visiting the elements can be partially unrolled

Loop peeling • When the body of a loop tests whether it is on a boundary, and has a special case for that boundary, it is often advantageous to have separate code for the boundary avoiding the conditional in the loop body. • Eg, median filter

Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; // what happens if the loop gets unrolled once? for i = 3 to n { c=a+b, a=b; b=c } return c;

Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; for i = 3 to n { c=a+b, a=b; b=c } return c; fibonacci(n) a=b=1; for i = 1 to (n/2 -1) {a=a+b; b=a+b} if odd(n) b = a+b; return b;

Memory hierarchy (cache) issues • Processor are an order of magnitude faster than memories • both have been speeding up exponentially for ~30 years: but with different bases, so their ratio has been growing exponentially as well • caches keep recently used (temporal locality) and fetch in cache lines (spatial locality)

cache issues • memory wall • getting over it: cache • cache line • cache replacement policy: LRU • cache and memory layout of 1D representation of 2D arrays inC • row access • col access

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] B is accessed in column order. If the arrays are (as in C) stored in row major order, this causes cache misses and unnecessary reads!!

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] While one row of A is read, all of B is read If the cache cannot keep all of B and uses the Least Recently Used replace policy, all reads of B will cause a cache miss

Tiling for improved cache behavior Instead of reading a whole row of A and doing n whole row A column B inner products we can read ablock of A and compute smaller inner products with sub columns of B. (Remember blocked matrix multiply in Strassen) These partial products are then added up.

Conventional matrix multiply

Conventional matrix multiply etc. .....

Conventional matrix multiply All elements of B are used once, while all of row A[i] are used n times. A[i] may fit in the cache, B will probably not!

Tiled matrix multiply

Reuse of tile of B • A kxk tile of A (which can fit in the cache) block multiplies with a kxk tile of B (which can fit in the cache) and thus reuses the B tile k times, potentially providing better cache use • We can parameterize our program with k and experiment • Data and loop reordering matrix multiply: assignment 2

Experiments you can do • Transpose B for better cache line behavior • Tile the loop as in the example • In array access A[i*N+j] avoid the multiply by doing pointer increments and dereferences • You will have a number of versions of your code. Make a 2D table of results. Then make observations about your results. In a follow up discussion, exchange your experiences

CS420 lecture six Loops

CS420 lecture six Loops

Presentation Transcript

Lecture Six

Lecture Six: Outline

CS420 Lecture 9 Dynamic Programming

CS420 lecture four Recurrence Relations

CS420 lecture five Priority Queues, Sorting

CS420 lecture one Problems, algorithms, decidability, tractability

Lecture 4 Loops

CS420 lecture twelve Shortest Paths

CS420 lecture three Lowerbounds

CS420 lecture eight Greedy Algorithms

CS420: Dancing Links

Lecture Six

CS420 lecture two Fundamentals

ITM352 Loops Lecture #6

CS420 lecture ten BACKTRACK

CS420 – Spring 2007

Lecture Six

Lecture Six

Lecture Six

Lecture Six

LECTURE SIX FUNCTIONALISM