1 / 53

CS420 lecture six Loops

CS420 lecture six Loops. Time Analysis of loops. Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if ( A[j ] > A[j+1) swap(A,j,j+1) 1. loop body takes constant time 2. loop body is executed times. Convex hull.

terry
Download Presentation

CS420 lecture six Loops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS420 lecture sixLoops

  2. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop body takes constant time 2. loop body is executed times

  3. Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest convex polygon surrounding them all.

  4. Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest polygon surrounding them all. The problem reduces to finding line segments connecting points of the set.

  5. Convex hull

  6. Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line.

  7. Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. How do you find out all other points are on the same side?

  8. Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(pi,pj,pk)

  9. Convex hull: first attempt for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(pi,pj,pk) check is O(1) so this algorithm is O(n3)

  10. the question that drives us.....

  11. is there a better algorithm • Find lowest point P1 • Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn • Start with P1-P2 in current hull • for i from 3 to n • add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break

  12. is there a better algorithm • Find lowest point P1 • Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn • Start with P1-P2 in current hull • for i from 3 to n • add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break 4 3 2 1

  13. Complexity? • find lowest: O(n) • sort O(nlgn) • nested add/eliminate loop outer: i from 3 to n inner: j from i-1 downto 3 O(?)

  14. nested add/eliminate loop • O(N) !! why? • n-2 points considered in i loop j loop either eliminates a point, ie it will not be checked again, or stops. The total number of points considered in all j loop iterations is therefore O(n) • Complex hull algorithm complexity O(nlgn)

  15. is there a better algorithm? • no

  16. is there a better algorithm? • no, argument is harder (lower bound arguments usually are) • it can be shown that sorting can be reduced to convex hull (reduced: translated such that when the convex hull problem is solved the original sorting problem is solved) and we have shown that sorting is Ω(nlgn)

  17. 3,9 2,4 1,1 reduction: x x,x2 sort({3, 1, 2})  convex hull({(3,9), (2,4), (1,1)})

  18. Sub-O Optimizations • Suppose you have written an asymptotically optimal program, and still want to speed it up. • Using a profiler identify which parts of your code are the hotspots of your program. • 10/90 rule of thumb: 90% of the time is spent in 10% of the code: hotspots • Usually some of the innermost loops • Only improve the hotspots. Leave the rest clear and simple.

  19. Data reorganization • Create sentinel (value at boundary) to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++;

  20. Data reorganization • Create sentinel to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++; • Sentinel: value at boundary x[n]=T; i=0; while (x[i]!=T)i++; found = (i<n);

  21. Loop unrolling • Loop unrolling is textually repeating the loop body so that the loop control is executed fewer times • Eg, a median filter operator on an image executes a 3x3 inner loop for each resulting pixel; this can be fully unrolled • some compilers (eg CUDA) allow unroll kpragmas • in a linked list, if the last element points at itself, visiting the elements can be partially unrolled

  22. Loop peeling • When the body of a loop tests whether it is on a boundary, and has a special case for that boundary, it is often advantageous to have separate code for the boundary avoiding the conditional in the loop body. • Eg, median filter

  23. Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; // what happens if the loop gets unrolled once? for i = 3 to n { c=a+b, a=b; b=c } return c;

  24. Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; for i = 3 to n { c=a+b, a=b; b=c } return c; fibonacci(n) a=b=1; for i = 1 to (n/2 -1) {a=a+b; b=a+b} if odd(n) b = a+b; return b;

  25. Memory hierarchy (cache) issues • Processor are an order of magnitude faster than memories • both have been speeding up exponentially for ~30 years: but with different bases, so their ratio has been growing exponentially as well • caches keep recently used (temporal locality) and fetch in cache lines (spatial locality)

  26. cache issues • memory wall • getting over it: cache • cache line • cache replacement policy: LRU • cache and memory layout of 1D representation of 2D arrays inC • row access • col access

  27. Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

  28. Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] B is accessed in column order. If the arrays are (as in C) stored in row major order, this causes cache misses and unnecessary reads!!

  29. Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] While one row of A is read, all of B is read If the cache cannot keep all of B and uses the Least Recently Used replace policy, all reads of B will cause a cache miss

  30. Tiling for improved cache behavior Instead of reading a whole row of A and doing n whole row A column B inner products we can read ablock of A and compute smaller inner products with sub columns of B. (Remember blocked matrix multiply in Strassen) These partial products are then added up.

  31. Conventional matrix multiply

  32. Conventional matrix multiply

  33. Conventional matrix multiply

  34. Conventional matrix multiply

  35. Conventional matrix multiply

  36. Conventional matrix multiply

  37. Conventional matrix multiply etc. .....

  38. Conventional matrix multiply All elements of B are used once, while all of row A[i] are used n times. A[i] may fit in the cache, B will probably not!

  39. Tiled matrix multiply

  40. Tiled matrix multiply

  41. Tiled matrix multiply

  42. Tiled matrix multiply

  43. Tiled matrix multiply

  44. Tiled matrix multiply

  45. Tiled matrix multiply

  46. Tiled matrix multiply

  47. Tiled matrix multiply

  48. Reuse of tile of B • A kxk tile of A (which can fit in the cache) block multiplies with a kxk tile of B (which can fit in the cache) and thus reuses the B tile k times, potentially providing better cache use • We can parameterize our program with k and experiment • Data and loop reordering matrix multiply: assignment 2

  49. Experiments you can do • Transpose B for better cache line behavior • Tile the loop as in the example • In array access A[i*N+j] avoid the multiply by doing pointer increments and dereferences • You will have a number of versions of your code. Make a 2D table of results. Then make observations about your results. In a follow up discussion, exchange your experiences

More Related