40 likes | 126 Views
L1:. nop cmp add add bg L1 sub nop nop. 0. 1. 2. 3. 4. 5. 6. 7. 8. or. L1:. nop nop nop cmp add add bg L1 sub. 0. 1. 2. 3. 4. 5. 6. 7. 8. #ifdef FAST
E N D
L1: nop cmp add add bg L1 sub nop nop 0 1 2 3 4 5 6 7 8 or L1: nop nop nop cmp add add bg L1 sub 0 1 2 3 4 5 6 7 8 #ifdef FAST int init (int i) {return (10*(i*1000));} #endif #ifdef SLOW int init (int i) {return (10*(i*1000 + 100));} #endif int i,n,j,k; main() { n = init(10000); j = 100000; k = 1342890; for (I=n; i > 0; i--){ j += (((j&0)+1)); k += (((k&0+1)); } } L1: cmp %g1, 0 add %g2,1,%g2 add %g5,1,%g5 bg,a,pt %icc, L1 sub %g1,1,%g1 Problems: Odd-Fetch “When the target of a branch is word 1 or word 3 of an I-cache line and the fourth instruction to be fetched is a branch, the branch prediction bits from the wrong pair of instructions is used.” [UltraSPARC-I User’s Manual] Slow Alignments: Performance: Fast Slow C-code: 1.2 seconds 3.6 - 4.2 secs Assembly code: 2 cycles/iteration 6 - 7 cycles/iteration
Problems: Delay Slot “…if the address of the instructions in an [I-cache] group cross a 32-byte boundary, an implicit branch is “forced” between instructions at address 31 and 32 (low order bits). That rule has a performance immpact only if a branch is in that specific group. Care should be taken not ot place a branch in a group that crosses this boundary….A group containing instructions I0 (branch), I1, I2, and I3 will be broken, because an artificial branch is forced after address 31 and there is already a branch in the group.” [UltraSPARC User’s Manual] ... A1: Label 1: A2: nop A3: br Label 3: A4: nop … B1: Label 2: B2: cmp %l0, 0 B3: bg Label 1: B4: add %l0,-1,%l0 … C1: Label 3: C2: nop C3: br Label 2: C4: nop ... For each group that crosses a 32 byte boundary (ie crosses a cache line boundary) there is an additional cycle of runtime added per iteration. Therefore: # crossing Time/secs Cycles/Iter boundary 0 1.8 3 1 2.4 4 2 3.0 5 3 3.6 6 Cause: Each of the groups shown above already has branch in it, and when the group crosses a 32 byte boundary an “implicit branch” is added (as stated in manual). But only 1 branch can be executed each cycle, so each I-cache group that crosses a 32 byte boundary, and contains a branch takes 2 cycles to execute instead of 1.
... A1: Label 1: A2: nop A3: br Label 3: A4: nop … B1: Label 2: B2: cmp %l0, 0 B3: bg Label 1: B4: add %l0,-1,%l0 … C1: Label 3: C2: nop C3: br Label 2: C4: nop ... For each group that crosses a 32 byte boundary (ie crosses a cache line boundary) there is an additional cycle of runtime added per iteration. Therefore: # crossing Time/secs Cycles/Iter boundary 0 1.8 3 1 2.4 4 2 3.0 5 3 3.6 6 Problems: Fetching Limitation Basic Problem: Only instructions from a SINGLE I-cache line can be fetched each cycle - For straight line code, this is not an issue - For code with many CTI this can cause a slowdown even if all of the branches are predicted correctly Cause: Every time a group of instructions crosses a 32 byte boundary, two I-cache accesses must be used to fetch the whole group. And in order for the processor to sustain a high rate of instruction dispatch, a high rate of instruction fetch must be occuring from the I-cache to the I-buffer. In the worst case here though, only 1-2 out instructions out of every group fetched is actually being executed, causing the total execution rate to be cut in half.
fmovs %f1, %f1 nop nop fmuls %f2, %f2,%f2 fmovs %f3, %f3 nop nop fmuls %f4, %f4,%f4 fmovs %f5, f5 cmp %l0, 0 bg .LL7 add %l0, -1, %l0 Problems: Grouping “UltraSPARC-I can execute up to four instructions per cycle. The first three instructions in a group occupy slots that are interchangeable with respect to resources…The fourth slot can only be used for PC-based branches or for floating-point instructions.” [UltraSPARC Users Manual] What the UltraSPARC Manual calls the “fourth slot” is actually the first seqential instruction in the group. So if the first sequential instruction in the group is not a floating point operation or a CTI, then all four instructions cannot be dispatched in one group. In example to the left, the first sequential instruction in each group is a fmovs instruction, so the loop can be executed in 3 instructions, and the running time for 100, 000,000 iterations on a UltraSPARC-Enterprise 166 is 1.8 secs. nop fmovs %f1, %f1 nop fmuls %f2, %f2,%f2 fmovs %f3, %f3 nop nop fmuls %f4, %f4,%f4 fmovs %f5, f5 cmp %l0, 0 bg .LL7 add %l0, -1, %l0 In the second example, there is no way to align the groups in this loop such that the first instruction in each group is either a branch or a floating point operation. So on average 3 out of every 7 cycles will be allowed to execute 4 instructions (instead of 3) because of a floating point operation as the first instruction. Thus the execution of this loop takes 3.5 instructions on average, and the running time