200 likes | 359 Views
Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand. EEL 6935 - Embedded Systems Dept. of Electrical and Computer Engineering University of Florida. Liza Rodriguez Aurelio Morales. Outline
E N D
Pipelines for Future Architectures in Time Critical Embedded SystemsBy: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL 6935 - Embedded Systems Dept. of Electrical and Computer Engineering University of Florida Liza Rodriguez Aurelio Morales
Outline • PipeliningReview • TimingAnalysis • Anomalies • Domino Effects • ArchitectureClassifications • Conclusions
Outline • PipeliningReview • TimingAnalysis • Anomalies • Domino Effects • ArchitectureClassifications • Conclusions
Pipelining Review • Pipelining is an implementation technique where multiple instructions are overlapped in execution • Pipelining takes advantage of parallelism that exists among the actions needed to execute and instruction • Pipelining is like an assembly line, each stage operates in parallel with the other stages • Instructions enter at one end, progress through the stages, and exit at the other end • Pipelining is the key implementation technique used to make fast CPUs
Pipelined Example • Pipeline registers separate functional units to allow parallel operation • Pipeline will stall if there is a hazard LD r4, 0(r3) LD r4, 0(r3) ADD r1, r7, r3 ADD r1, r7, r3 ADD r2, r6, r30 ADD r2, r6, r30 Fetch Decode Execute Memory Write Back ADD 001100 read 101011 101011 ADD LOAD r6 + r3 0 + r3 r7 + r3 XXX XXX r4 r2 r1 LD r4, 0(r3) 5 cycles (5) ADD r1, r7, r3 1 cycles (4) ADD r2, r6, r30 1 cycles (4)
Further Optimizations • Superscalar – executes more than one instruction per clock cycle by simultaneously dispatching multiple instructions to redundant functional units • Branch Prediction – predict branches based on a predefined static algorithm or based on dynamic branch history • Out of order execution – instructions are dynamically scheduled to avoid hazards • and dependencies that may • stall the pipeline Fetch Decode Execute Memory Write Back Fetch Decode Execute Memory Write Back Reservation Stations Functional Units ADD r1, r2, r3 wait SUB r1, r2, r3 wait MUL r6, r7, r8 ready LD r4, (0) r5 wait ST r2, (0) r1 ready LD r4, (0) r1 wait Execute Memory
Outline • PipeliningReview • TimingAnalysis • Anomalies • Domino Effects • ArchitectureClassifications • Conclusions
Real Time Embedded Systems • Timing Analysis • The analysis for a set of tasks executing on a given hardware to guarantee that timing constraints will be met • Timing requires upper and lower bounds on execution times of tasks to be known: • Worst Case Execution Time (WCET), Best Case Execution Time (BCET) • Analysis results are highly dependent on the architecture • An architecture without accompanying performance analysis technology should not be seriously considered for time critical embedded applications • Desired Criteria • Soundness – valid, reliable, free from random error • Obtainable Precision – architecture has predictability properties • Analysis effort to reach precision – depends on solution space to be explored
Timing Analysis • Non-Pipelined Architecture – Simple • Add the execution times of individual instructions to obtain a bound on the execution time of a basic block • Pipelined Architecture – Complex • Overlapped instructions - cannot consider individual instructions in isolation • Instructions must be considered collectively to obtain timing bounds
Timing Analysis • Pipelined Architecture – Complex • To do WCET analysis, the most costly pipeline path should be selected • To compute a precise bound, the analysis needs to include as many “timing accidents” as possible • Timing accidents: data hazards, branch mispredictions, occupied functional units, cache misses, etc. • Issues: timing anomalies and domino effects • Thus, timing has to follow all possible successor states • The more performance enhancing features the pipeline has, the larger the search space
Timing Anomaly • Formal definition - a situation where the local worst case does not contribute to the global worst case • A better definition – a positive improvement to the architecture that has a negative effect on execution time • Examples: • A caches miss may result in a shorter execution time • Shortening an instruction leads to longer execution time
Timing Anomaly Example: Cache Hit or Miss A B C D E • A LD r4, 0(r3) • B ADD r5, r4, r4 • C ADD r1, r6, r6 • D MUL r2, r1, r1 • E MUL r3, r2, r2 • Miss Penalty 8 cyc. • LSU 2 cyc. • ALU 1 cyc. • Multiplier 4 cyc. • Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 A LSU ALU MULT B C D E Cache Hit A B C D E 1 2 3 4 5 6 7 8 9 10 11 12 13 A LSU ALU MULT C B D E Cache Miss
Timing Anomaly Example: Reduced Instruction A B C D E • A MUL r2, r1, r1 • B ADD r3, r2, r2 • C ADD r4, r5, r5 • D LD r6, 0(r4) • E ADD r7, r6, r6 • Miss Penalty 8 cyc. • LSU 4 cyc. • ALU 2 cyc. • Multiplier ? cyc. • Architecture is made up of functional units and reservation stations – similar to Tomasulo’s Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 LSU ALU MULT D B E C A Multiplier = 5 cycles A B C D E 1 2 3 4 5 6 7 8 9 10 11 12 13 D LSU ALU MULT E B C A Multiplier = 2 cycles
Domino Effects • Formal definition – a system exhibits a domino effect if there are two hardware states s, t such that the difference in execution time may be arbitrarily high and cannot be bounded by a constant • A better definition – a minor timing accident can cause an unbounded increase in execution time • Examples: • Timing accident in a loop • PowerPC755 pipeline – Schneider • Pseudo-least-recently used (PLRU) replacement policy – Berg
Domino Effects A B A B A B A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A A A B B B A ALU A B A B A B A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A B A A B A B ALU • A ADD r4, r3, r3 • B ADD r1, r2, r2 • A Dispatch EA +5 • A Execute Immdt • B Dispatch DA+4 • B Execute DA+6 • First A gets delayed one clock cycle due to a dependency with the previous instruction
Outline • PipeliningReview • TimingAnalysis • Anomalies • Domino Effects • ArchitectureClassifications • Conclusions
Classification of Architectures • Fully Timing Compositional Architectures • No timing anomalies or domino effects • Timing analysis can safely follow worst case paths only • Example: ARM7 • Compositional Architectures with Constant Bounded Effects • Exhibit timing anomalies but no domino effects • Timing analysis has to consider all paths but can be optimized to safely discard all local non-worst case paths by adding a constant number of cycles to the worst case path – trading precision with efficiency • Example: Infineon TriCore • Non Compositional Architectures • Exhibit timing anomalies and domino effects • Timing analysis has to follow all possible paths since a local effect can greatly influence the future execution arbitrarily • Example: PowerPC775
Outline • PipeliningReview • TimingAnalysis • Anomalies • Domino Effects • Architecture Classifications • Conclusions
Conclusions • Architectural optimizations in embedded systems are necessary to improve performance and to meet critical time constraints • Pipelines - multiple issue, out of order execution, branch prediction, etc. • However, an architectural optimization may not be worth implementing if effects such as timing anomalies and domino will have a negative impact on timing analysis • How good is an optimization if you can’t measure its effects? • A trade off exists between the amount of executions time you can save by pipeline optimizations and the amount of precision you lose in timing analysis