1 / 26

Mid-Term R eview

This announcement covers the upcoming mid-term review session focusing on topics such as Latency vs. Throughput, CPU Architecture, Profiling, and Memory Performance. Key techniques in CPU architecture evolution and memory hierarchy optimizations are discussed.

ellend
Download Presentation

Mid-Term R eview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mid-Term Review

  2. Announcement • Midterm • Time: Oct. 24th, during class, 50 minutes • Location: BA1130 • Policy: close book, no aids • Coverage: until dynamic memory management

  3. Topics covered in midterm • Latency vs. throughput • CPU architecture • Profiling • Compiler and optimization • Memory performance • Memory hierarchy, optimizing for caches, virtual memory • Dynamic memory management • Malloc, garbage collection

  4. CPU architecture: key techniques Year Tech. Processor CPI 1971 4004 no pipeline n pipeline close to 1 1985 386 branch prediction closer to 1 Pentium < 1 1993 Superscalar PentiumPro << 1 1995 Out-of-Order exe. Pentium III 1999 Deep pipeline shorter cycle Pentium IV 2000 SMT <<<1

  5. Profiling • Why do we need profiling? • Amdahl’s law speedup = OldTime / NewTime • Example problem: If an optimization makes loops go 4 times faster, and applying the optimization to my program makes it go twice as fast, what fraction of my program is loops? Solution: newtime= x*oldtime/4 + (1-x)*oldtime speedup = oldtime/newtime = 1/(x/4 + 1-x); 1/(1-0.75x) = 2 x = 2/3

  6. Profiling tools • We discussed quite a few of them • /usr/bin/time • get_seconds() • get_tsc() • gprof, gcov, valgrind • Important things: • What info. does each tool provide? • What are the limitations?

  7. Compiler and optimization • Machine independent optimizations • Constant propagation • Constant folding • Common Subexpression Elimination • Dead Code Elimination • Loop Invariant Code Motion • Function Inlining • Machine dependent (apply differently to different CPUs) • Loop unrolling • What are the blockers for compiler optimization? • What are the trade-offs for each optimization?

  8. Q9 from midterm 2013 Consider the following functions: intmax(int x, int y) { return x < y ? y : x; } void incr(int *xp, int v) { *xp += v; } int add (inti, int j) { return i + j; } The following code fragment calls these functions: 1 intmax_sum (int m, int n) { // m and n are large integers 2 inti; 3 int sum = 0; 4 5 for (i = 0; i < max(m, n); incr (&i, 1)) { 6 sum = add(data[i], sum);// data is an integer array 7 } 8 9 return sum; 10 } A). identify all of the optimization opportunities for this code and explain each one. Also discuss whether it can be performed by the compiler or not. (6 marks)

  9. Loop unrolling for machine dependent optimization void vsum5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; inti; for (i = 0; i < limit; i+=3){ sum += data[i]; sum += data[i+1]; sum += data[i+2]; } for ( ; i < length; i++){ sum += data[i] *dest = sum; } void vsum4(vec_ptr v, int *dest) { inti; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } Why does loop unrolling help?

  10. Executing without loop unrolling 4 integer ops

  11. Memory performance: cache • Motivation • L1 cache reference 0.5 ns • Main memory reference 100 ns • 200X slower!

  12. Why Caches Work • Locality:Programs tend to use data and instructions with addresses near or equal to those they have used recently • Temporal locality: • Recently referenced items are likely to be referenced again in the near future • Spatial locality: • Items with nearby addresses tend to be referenced close together in time block block

  13. Direct Mapped Cache • Incoming memory address divided into tag, index and offset bits • Index determines set • Tag is used for matching • Offset determines starting byte within block block size = 8 bytes 011…1 0…01 100 Address (32 bits) S = 64 sets tag [31:9] 23 bits index [8:3] 6 bits offset [2:0] 3 bits v v v v tag tag tag tag 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7

  14. Two-way Set Associative Cache (E = 2) • 2-way set associative: two blocks per set block size = 8 bytes block size = 8 bytes 011…1 0…01 100 S = 32 sets tag [31:8] 24 bits index [7:3] 5 bits offset [2:0] 3 bits v v v v v v v v tag tag tag tag tag tag tag tag 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7

  15. Two-way Set Associative Cache (E = 2) block size = 8 bytes block size = 8 bytes 011…1 0…01 100 index lookup S = 32 sets tag [31:8] 24 bits index [7:3] 5 bits offset [2:0] 3 bits v v v v v v v v tag tag tag tag tag tag tag tag 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7

  16. Two-way Set Associative Cache (E = 2) tag block size = 8 bytes block size = 8 bytes 011…1 0…01 100 check valid, compare and match with any one tag S = 32 sets tag [31:8] 24 bits index [7:3] 5 bits offset [2:0] 3 bits v v v v v v v v tag tag tag tag tag tag tag tag 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7

  17. Two-way Set Associative Cache (E = 2) tag block size = 8 bytes block size = 8 bytes 011…1 0…01 100 lookup bytes S = 32 sets • If no match then one line in set is selected for eviction and replacement • Replacement policies: random, least recently used (LRU), … tag [31:8] 24 bits index [7:3] 5 bits offset [2:0] 3 bits v v v v v v v v tag tag tag tag tag tag tag tag 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7

  18. Cache miss analysis on matrix mult. c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { inti, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i][j] += a[i][k]*b[k][j]; } j c a b += * i n/8 misses n misses • First iteration: • How many misses? • n/8 + n = 9n/8 misses • Second iteration: • Number of misses:n/8 + n = 9n/8 misses • Total misses (entire mmm): • 9n/8 * n2 = (9/8) * n3 += * 8 wide += * 8 wide

  19. Tiled Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { inti, j, k; for (i = 0; i < n; i+=T) for (j = 0; j < n; j+=T) for (k = 0; k < n; k+=T) /* T x T mini matrix multiplications */ for (i1 = i; i1 < i+T; i1++) for (j1 = j; j1 < j+T; j1++) for (k1 = k; k1 < k+T; k1++) c[i1][j1] += a[i1][k1]*b[k1][j1]; } j1 c a b += * i1 Tile size T x T

  20. Virtual memory CPU Chip TLB PTE 2 3 VA Cache/ Memory Page Table MMU 1 PA VA CPU 4 Data 5

  21. Complete data reference analysis • Q7@midterm 2013

  22. Dynamic mem. management • Alignment • What is alignment? why alignment? • Q6@midterm 2013

  23. malloc/free • How do we know how much memory to free just given a pointer? • How do we keep track of the free blocks? • How do we pick a block to use for allocation -- many might fit? • How do we reinsert freed block?

  24. Keeping Track of Free Blocks • Method 1: Implicit list using lengths -- links all blocks • Method 2: Explicit list among the free blocks using pointers within the free blocks • Method 3: Segregated free list • Different free lists for different size classes 4 4 4 4 6 6 4 4 Successor links A B 4 4 4 4 6 6 4 4 4 4 C Predecessor links

  25. Memory Utilization • Aggregate payload • malloc(p) has payload of p bytes • sum of currently allocated payloads • Peak memory utilization • aggregate payload/max heap size

  26. Final remarks • Time/location/policy/coverage • Time: Oct. 24th, during class, 50 minutes • Location: BA1130 • Policy: close book, no aids • Coverage: until dynamic memory management • Make sure you go over the practice midterms • Best of luck!

More Related