260 likes | 373 Views
Program Optimization: Measurement/Loops/Parallelism CSCi 2021: Computer Architecture and Organization. Chapter 5.6-5.9. Today. Measure performance Loops Exploiting Instruction-Level Parallelism Exam Assignment 4 out later today performance optimization. Last Time. Program optimization
E N D
Program Optimization: Measurement/Loops/Parallelism CSCi 2021: Computer Architecture and Organization Chapter 5.6-5.9
Today • Measure performance • Loops • Exploiting Instruction-Level Parallelism • Exam • Assignment 4 out later today • performance optimization
Last Time • Program optimization • Code motion • Memory aliasing • Procedure calls
Exploiting Instruction-Level Parallelism • Hardware can execute multiple instructions in parallel • pipelining and multiple hardware units for execution • Performance limited by data dependencies • Simple transformations can have dramatic performance improvement • Compilers often cannot make these transformations • Lack of associativity in floating-point arithmetic
Benchmark Example: Data Type for Vectors /* data structure for vectors */ typedefstruct{ intlen; double *data; // double data[MaxLen]; } vec; 1 len-1 0 len data /* retrieve vector element and store at val */ intget_vec_element(v *vec, intidx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; }
Benchmark Computation void combine1(vec_ptr v, data_t *dest) { long inti; *dest = IDENT; // 0 or 1 for (i = 0; i < vec_length(v); i++) { data_tval; get_vec_element(v, i, &val); *dest = *dest OP val; } } • Data Types • Use different declarations for data_t • int • float • double • Operations • Use different definitions of OP • + • * Compute sum or product of vector elements
Cycles Per Element (CPE) • Convenient way to express performance of program that operates on vectors or lists, O(n) doesn’t tell us enough • In our case: CPE = cycles per OP (*dest = *dest OP val) • Sum of cycles in loop, divide by N • T = CPE*n + Overhead • CPE is slope of line measure cycles using special instructions prog1: Slope = 4.0 prog2: Slope = 3.5
Fundamental Limits • Latency • how long something takes • Issue time • how long to wait before issuing next operation • can be < 1 clock cycle due to parallelism • Throughput = 1/issue time (ideal or max) • If latency is 5 cycles, but tput is 1 cycle, what does that tell us?
Benchmark Performance void combine1(vec_ptr v, data_t *dest) { long inti; *dest = IDENT; for (i = 0; i < vec_length(v); i++) { data_tval; get_vec_element(v, i, &val); *dest = *dest OP val; } } Compute sum or product of vector elements Why won’t compiler move vec_length? highly pipelined
Basic Optimizations void combine4(vec_ptr v, data_t *dest) { inti; int length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } • Move vec_length out of loop • Remove call to get_vec_element • Avoid bounds check on each cycle (in get_vec_element) • Accumulate in temporary What does the temporary save?
Effect of Basic Optimizations void combine4(vec_ptr v, data_t *dest) { inti; int length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP d[i]; *dest = t; } • Eliminates sources of overhead in loop Drawback?
Looking at Execution: What does this suggest? Modern CPU Design Instruction Control Address Fetch Control Instruction Cache Retirement Unit Register File Instruction Decode Instructions Operations Register Updates Prediction OK? Execution Functional Units Integer/ Branch General Integer FP Add FP Mult/Div Load Store Operation Results Addr. Addr. • Arith units have internal pipelines • 100 instrs “in flight” • micro operations Data Data Data Cache
Loop Unrolling void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; inti; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = (x OP d[i])OP d[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } • Benefit? • Perform 2x more useful work per iteration for (i = 0; i < length; i++) t = t OP d[i];
Effect of Loop Unrolling • Helps integer multiply only • compiler does clever optimization (associativity) • Others don’t improve. Why? • Still sequential dependency between iterations x = (x OP d[i]) OP d[i+1];
Loop Unrolling with Reassociation void unroll2aa_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x = IDENT; inti; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x = x OP d[i]; } *dest = x; } • Can this change the result of the computation? • Yes, for FP. Why? Compare to before x = (x OP d[i]) OP d[i+1];
Effect of Reassociation • Nearly 2x speedup for FP +, FP * • Reason: Breaks sequential dependency • Why is that? (next slide) theoretical best x = x OP (d[i] OP d[i+1]);
Reassociated Computation • What changed: • Ops in the next iteration can be started early (no dependency) • Overall Performance • N elements, D cycles latency/op • Should be (N/2+1)*D cycles:CPE = D/2 x = x OP (d[i] OP d[i+1]); d0 d1 * d2 d3 1 * d4 d5 * * d6 d7 * * * *
Loop Unrolling with Separate Accumulators void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *d = get_vec_start(v); data_t x0 = IDENT; // 0 or 1 data_t x1 = IDENT; // 0 or 1 inti; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 = x0 OP d[i]; // “evens” x1 = x1 OP d[i+1]; // “odds” } /* Finish any remaining elements */ for (; i < length; i++) { x0 = x0 OP d[i]; } *dest = x0 OP x1; } • Different form of reassociation: actual parallelism
Effect of Separate Accumulators • 2x speedup (over unroll2) FP +, FP * • Breaks sequential dependency in a “cleaner,” more obvious way x0 = x0 OP d[i]; x1 = x1 OP d[i+1];
Separate Accumulators • What changed: • Two independent “streams” of operations • Overall Performance • N elements, D cycles latency/op • Should be (N/2+1)*D cycles:CPE = D/2 • CPE matches prediction! x0 = x0 OP d[i]; x1 = x1 OP d[i+1]; 1 d0 1 d1 * d2 * d3 * d4 * d5 * d6 * d7 * * *
Unrolling & Accumulating • Idea • Can unroll to any degree L • Can expose more potential parallelism • Limitations • Diminishing returns • Cannot go beyond throughput limitations of execution units • Short lengths (N, N < L) • Finish off iterations sequentially
The Exam • Coverage • Chapter 3.7 through 4 (up to 4.5); does not include performance optimization • Procedure calls; stack frames; stack/frame pointer; registers • know the code that must be generated to carry out a procedure call including its return • be able to manipulate the stack and access variables • recursion • Arrays and structures • know what they are; understand C code; alignment issues • understand how they map to assembly (for simple structs and 1D arrays)
Exam • Processor Architecture ISA • X86/Y86 – we will give a cheat sheet; no need to memorize all the assembly instructions; register layouts; definition of instructions • RISC/CISC • Know how to specify simple HCL; write simple logic gates • Be able to go from assembly instruction<->byte-level encodings; basic C<->assembly • Seq and Pipelined CPU • Hardware components: register file, ALU, etc • Know instruction stages (F, D, E, M, W) • Know why pipelining improves over seq • Know about data dependencies and hazards • Know how to measure basic performance: latency, throughput
Composition • Mix of short answer and work questions (multiple parts) • 20%, 80% • Recitation will go over an old exam • Hints: • Question about arrays and structs– know the assembly level • Question about SEQ/PIPE • Question about mapping assembly back to C • To study • Review notes, practice problems, homework questions • Refer back to things I *said* in class
Next Time • Good luck on the exam • No office hours on Friday (out of town, sorry) • Have a great weekend!