1 / 26

Chapter 5.6-5.9

Program Optimization: Measurement/Loops/Parallelism CSCi 2021: Computer Architecture and Organization. Chapter 5.6-5.9. Today. Measure performance Loops Exploiting Instruction-Level Parallelism Exam Assignment 4 out later today performance optimization. Last Time. Program optimization

netis
Download Presentation

Chapter 5.6-5.9

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Program Optimization: Measurement/Loops/Parallelism CSCi 2021: Computer Architecture and Organization Chapter 5.6-5.9

  2. Today • Measure performance • Loops • Exploiting Instruction-Level Parallelism • Exam • Assignment 4 out later today • performance optimization

  3. Last Time • Program optimization • Code motion • Memory aliasing • Procedure calls

  4. Exploiting Instruction-Level Parallelism • Hardware can execute multiple instructions in parallel • pipelining and multiple hardware units for execution • Performance limited by data dependencies • Simple transformations can have dramatic performance improvement • Compilers often cannot make these transformations • Lack of associativity in floating-point arithmetic

  5. Benchmark Example: Data Type for Vectors /* data structure for vectors */ typedefstruct{ intlen; double *data; // double data[MaxLen]; } vec; 1 len-1 0 len data /* retrieve vector element and store at val */ intget_vec_element(v *vec, intidx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; }

  6. Benchmark Computation void combine1(vec_ptr v, data_t *dest) { long inti; *dest = IDENT; // 0 or 1 for (i = 0; i < vec_length(v); i++) { data_tval; get_vec_element(v, i, &val); *dest = *dest OP val; } } • Data Types • Use different declarations for data_t • int • float • double • Operations • Use different definitions of OP • + • * Compute sum or product of vector elements

  7. Cycles Per Element (CPE) • Convenient way to express performance of program that operates on vectors or lists, O(n) doesn’t tell us enough • In our case: CPE = cycles per OP (*dest = *dest OP val) • Sum of cycles in loop, divide by N • T = CPE*n + Overhead • CPE is slope of line measure cycles using special instructions prog1: Slope = 4.0 prog2: Slope = 3.5

  8. Fundamental Limits • Latency • how long something takes • Issue time • how long to wait before issuing next operation • can be < 1 clock cycle due to parallelism • Throughput = 1/issue time (ideal or max) • If latency is 5 cycles, but tput is 1 cycle, what does that tell us?

  9. Benchmark Performance void combine1(vec_ptr v, data_t *dest) { long inti; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_tval; get_vec_element(v, i, &val); *dest = *dest OP val; } } Compute sum or product of vector elements Why won’t compiler move vec_length? highly pipelined

  10. Basic Optimizations void combine4(vec_ptr v, data_t *dest) { inti; int length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } • Move vec_length out of loop • Remove call to get_vec_element • Avoid bounds check on each cycle (in get_vec_element) • Accumulate in temporary What does the temporary save?

  11. Effect of Basic Optimizations void combine4(vec_ptr v, data_t *dest) { inti; int length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } • Eliminates sources of overhead in loop Drawback?

  12. Looking at Execution: What does this suggest? Modern CPU Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Register File Instruction Decode Instructions Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. • Arith units have internal pipelines • 100 instrs “in flight” • micro operations Data Data Data Cache

  13. Loop Unrolling void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; inti; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x OP d[i])OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } • Benefit? • Perform 2x more useful work per iteration for (i = 0; i < length; i++) t = t OP d[i];

  14. Effect of Loop Unrolling • Helps integer multiply only • compiler does clever optimization (associativity) • Others don’t improve. Why? • Still sequential dependency between iterations x = (x OP d[i]) OP d[i+1];

  15. Loop Unrolling with Reassociation void unroll2aa_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; inti; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } • Can this change the result of the computation? • Yes, for FP. Why? Compare to before x = (x OP d[i]) OP d[i+1];

  16. Effect of Reassociation • Nearly 2x speedup for FP +, FP * • Reason: Breaks sequential dependency • Why is that? (next slide) theoretical best x = x OP (d[i] OP d[i+1]);

  17. Reassociated Computation • What changed: • Ops in the next iteration can be started early (no dependency) • Overall Performance • N elements, D cycles latency/op • Should be (N/2+1)*D cycles:CPE = D/2 x = x OP (d[i] OP d[i+1]); d0 d1 * d2 d3 1 * d4 d5 * * d6 d7 * * * *

  18. Loop Unrolling with Separate Accumulators void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x0 = IDENT; // 0 or 1 data_t x1 = IDENT; // 0 or 1 inti; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 = x0 OP d[i]; // “evens” x1 = x1 OP d[i+1]; // “odds” } /* Finish any remaining elements */ for (; i < length; i++) { x0 = x0 OP d[i]; } *dest = x0 OP x1; } • Different form of reassociation: actual parallelism

  19. Effect of Separate Accumulators • 2x speedup (over unroll2) FP +, FP * • Breaks sequential dependency in a “cleaner,” more obvious way x0 = x0 OP d[i]; x1 = x1 OP d[i+1];

  20. Separate Accumulators • What changed: • Two independent “streams” of operations • Overall Performance • N elements, D cycles latency/op • Should be (N/2+1)*D cycles:CPE = D/2 • CPE matches prediction! x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; 1 d0 1 d1 * d2 * d3 * d4 * d5 * d6 * d7 * * *

  21. Unrolling & Accumulating • Idea • Can unroll to any degree L • Can expose more potential parallelism • Limitations • Diminishing returns • Cannot go beyond throughput limitations of execution units • Short lengths (N, N < L) • Finish off iterations sequentially

  22. Amdahl’s Law

  23. The Exam • Coverage • Chapter 3.7 through 4 (up to 4.5); does not include performance optimization • Procedure calls; stack frames; stack/frame pointer; registers • know the code that must be generated to carry out a procedure call including its return • be able to manipulate the stack and access variables • recursion • Arrays and structures • know what they are; understand C code; alignment issues • understand how they map to assembly (for simple structs and 1D arrays)

  24. Exam • Processor Architecture ISA • X86/Y86 – we will give a cheat sheet; no need to memorize all the assembly instructions; register layouts; definition of instructions • RISC/CISC • Know how to specify simple HCL; write simple logic gates • Be able to go from assembly instruction<->byte-level encodings; basic C<->assembly • Seq and Pipelined CPU • Hardware components: register file, ALU, etc • Know instruction stages (F, D, E, M, W) • Know why pipelining improves over seq • Know about data dependencies and hazards • Know how to measure basic performance: latency, throughput

  25. Composition • Mix of short answer and work questions (multiple parts) • 20%, 80% • Recitation will go over an old exam • Hints: • Question about arrays and structs– know the assembly level • Question about SEQ/PIPE • Question about mapping assembly back to C • To study • Review notes, practice problems, homework questions • Refer back to things I *said* in class

  26. Next Time • Good luck on the exam • No office hours on Friday (out of town, sorry) • Have a great weekend!

More Related