450 likes | 713 Views
Chapter 7 Performance Analysis Techniques. Outline. Real-time performance analysis Applications of Queue Theory Input / Output Performance Analysis of memory requirements. 7.1 Real-time performance analysis. Theoretical preliminaries. Complexity classes P, NP, NP-complete, NP-hard
E N D
Outline • Real-time performance analysis • Applications of Queue Theory • Input / Output Performance • Analysis of memory requirements
Theoretical preliminaries • Complexity classes P, NP, NP-complete, NP-hard • P: the class of problems that can be solved by an algorithm that runs in polynomial time on a deterministic computing machine. • NP: the class of all problems that can not be solved in polynomial time by any deterministic machine • But can verify a candidate solution be correct or not by a P-class algorithm. • NP-complete: belongs to the class NP and all other problems in NP are polynomial transformable to it. • NP-hard: if all problems in NP are polynomial transformable to this problem, but it has not been shown that it belongs to the class NP.
Examples • Boolean satisfiability problem (N-Sat, N Boolean variables) is NP-complete. • But, the SAT problem involving 2 or 3 number of Boolean variables is in P. • Can arise for requirements consistency checking • In general, NP-complete problems in RTS tend to be those relating to resource allocation occurring in multitask scheduling situation. • Implies no easy way to find the solutions
More examples • The problem of deciding whether it is possible schedule a set of periodic tasks that use only semaphores to enforce mutual exclusion is NP-hard. • The multiprocessor scheduling problem with two processors, no resources, arbitrary partial-order relations, and every task having a 1-unit computation time is polynomial. • The multiprocessor scheduling problem with two processors, no resources, independent tasks, and arbitrary task computation times is NP-complete. • The multiprocessor scheduling problem with two processors, no resources, independent tasks, arbitrary partial-order, and task computation times of either 1 or 2 units of time is NP-complete. • Partial-order: any task can call itself, A calls B, the reverse it not possible; if A calls B, B calls C, then A can calls C.
Arguments related to parallelization • Amdahl’s law • Statement: For a constant problem size, the incremental speedup approaches zero as the number of processor elements grows. • Formalism: Let N be # of equal processors available for parallel processing; Let S(0≤s≤1) be the fraction of program code that is of serial nature (cannot be parallelized) The speedup saturates to a limit value
Some discussion • Amdahl’s pessimistic law is cited as an argument against parallel systems and in particular, against massively parallel processors. • Taken as an insurmountable bottleneck that limited the efficiency and application of parallelism to various problems. • Later research provided new sights into Amdahl’s law and its relation to large-scale parallelism.
Flaws of Amdahl’s law • Key assumption of Amdahl’s law • “Problem size remains constant”, but the problem size tends to scale with the size of a parallel system • Items that scale with the problem size • parallel or vector part of a program • Items that do not grow with the problem size • Inherent time for vector start-up, program loading, serial bottlenecks, I/O that make up the serial component
Gustafson’s law • Definition: If the firmly serial code fragment, S, and the parallelized fragment, (1-S) are processed by a parallel computer system with N equal processor, the achievable speedup is • Does not saturate as N approaches infinity • Provides a more optimistic picture of speedup • The current “multi-core era” could be viewed as a partial consequence of Gustafson’s Law. “A more efficient way to use a parallel computer is to have each processor perform similar work, but on a different section of the data .. where large computations are concerned” (Hills, 1998)
Gustafson vs. Amdahl • Gustafson’s unbound speedup compared with Amdahl’s saturating speedup when 50% of code is suitable for parallelization Speedup
Execution time estimation from program code • Analyzing RTSs to see if they meet their critical deadlines is • rarely possible exactly due to NP-completeness of most scheduling problems. • But possible to get a handle on the system’s behavior through approximate analysis • First step in performing schedulability analysis is to predict, estimate or measure the execution time of essential code units. • Methods to decide the task’s execution time ei • Using logical analyzer (most accurate, employed in the final stages during system integration) • Counting CPU-specific instructions manually or using automated tools • Reading system clock before and after executing the particular program code.
Example: instruction counting app • Given that a certain program module converts raw sensor pulses into the actual acceleration components that are later compensated for temperature and other effects. • The module is just to decide if the aircraft is still on the ground, in which case only a small acceleration reading for each of the XYZ components is allowed (represented by the symbol constant PRE_TAKE) • C code with assembly instruction are given.
Example1 • Tracing the worst-case execution path and counting the instructions shows that • 12 integer (7.2µs) and • 15 floating-point (75µs) instructions for a total execution time of 82.2 µs. • Since this sequence of code runs in a 5ms cycle, the corresponding time-loading is only 82.2/5000
Example 2: Estimation in non-pipelined CPU platform • All execution paths • Path 1: 1-4, 9-10, 12 • 7 instructions @0.6µs each -> 4.2µs (BCET) • Path 2: 1-7, 11-12 • Path 3: 1-8, 12 • 9 instructions @ 0.6µs each-> 5.4µs (WCET)
Example 2: Estimation in pipelined CPU platform • Assume a three-stage pipeline • Fetch (F), decode (D), execute (E) • Each stage takes 0.6µs/3=0.2µs Go to Figure 7.2, 7.3, 7.4 • Execution time of all three paths is 2.6µs
Some discussions • RTSs designers frequently use special software to estimate instruction execution times and CPU throughput. Users • can typically input • CPU type • Memory speeds for different address ranges • Instruction mix • and can compute total instruction times and throughput
Example 3: Timing accuracy with a 60-kHz system clock • Suppose • 2000 repetitions of the program code take 450 ms • Clock granularity of 16.67µs. • Hence, the execution time measurement has a high accuracy as
C code to compute the time of instruction execution • API functions • current_clock_time(): a system function that returns the current time • function_bo_be_timed(): the actual code to be timed. Go to timer Code
Analysis of polled-loop systems • The response time consists of three components • The cumulative hardware delays involved in setting the software flag by some external device • The time for the polled loop to test the flag • The time needed to process the event associated with the flag • Assumption: sufficient processing time is available between consecutive events Processing Delay (Milliseconds?) Flag-Testing Delay (Microseconds) Flag-setting Delay (Nanoseconds) Excitation Response Time Response
Analysis of polled-loop systems • If events overlap each other, • A new event is initiated while a previous one is still being processed • Then, the response time becomes worse, the time for the Nth overlapping event is bounded by • tF: the time to check the flag • tP: the time to process the event • Ignore the time for the external device to set the flag. • In practice, we place some limit on N • N is the number of events that are allowed to overlap. • Overlapping events may not be desirable at all in certain situations.
phase_a1(); phase_b1(); phase_a2(); phase_b2(); …. Review:Coroutine central dispatcher 4 void task_b() { for(;;) { switch(state_b) { case 1: phase_b1(); break; case 2: phase_b2(); break; case 3: phase_b3(); break; } }} 1 2 3 Two tasks task_a and task_b are executing in parallel and in isolation. State_a and state_b are global variables managed by the dispatcher to maintain synchronization and inter-task communication. void task_a() { for(;;) { switch(state_a) { case 1: phase_a1(); break; case 2: phase_a2(); break; case 3: phase_a3(); break; } }}
Analysis of coroutine systems Tracing the execution path in a two-task coroutine system. A central dispatch calls Task_1 and task_2 by turns, and a switch statement is not shown here. void task_1() … task_1a(); return; task_1b(); return; task_1c(); return; Begin Here void task_2() … task_2a(); return; task_2b(); return; • The absence of interrupts in coroutine systems makes the determination of response time easy. • Time obtained by tracing the worst-case execution path through all tasks. • Must first determine the execution time of each phase Repeat the Sequence
Review: Round-robin scheduling is simple and predictable. • Achieve fair allocation of CPU resources to tasks of the same priority by time multiplexing • Each executable task is assigned a fixed time quantum or time slice to execute • A fixed-rate clock is used to initiate an interrupt at a rate corresponding to the time slice B completes B takes over Task B C runs its slice Task A Task A Task C Begin A preempted A resumes A finishes its slice
Analysis of round-robin systems • Assumptions and definitions • n tasks in the ready queue, no new ones arrive after scheduling, none terminates prematurely • Let q be constant timeslice for each task • Possible slack times in timeslice are not utilized • Let c=max{c1,…,cn} be maximum execution time Thus, the worst-case time T from readiness to completion for any task (upper bound)
Example: turnaround time calculation without context switching overhead • Suppose only one task with a maximum execution time of 500ms, and the time quantum is 100ms, thus • Suppose five equally important tasks, each with a maximum execution time 500ms, time quantum is 100ms
Non-negligible context switching overhead • Let o be context switching overhead with task switching. • Thus, each task waits no longer than (n-1)q until its next time slice, plus • an inherent overhead of n*o time units each time around for context switching.
Examples • Suppose one task with a maximum execution time of 500ms, time quantum is 40ms, and context switch time is 1ms, thus • Suppose six equally important tasks, each with a maximum execution time 600ms, time quantum is 40ms, context switch costs 2ms
Selection of time quantum q • In terms of the time quantum, it is desirable that q < c to achieve fair behavior for the round-robin system. • If q is very large, the round-robin algorithm is in fact the first-come, first-served algorithm, in that each task will execute to its completion within the very large time quantum.
Review: Fixed-priority scheduling: rate-monotonic approach • Theorem Given a set of periodic tasks and preemptive priority scheduling, assigning interrupt priorities such that the tasks with shorter periods have higher priorities, yields an optimal scheduling algorithm. Optimality implies: if a schedule that meets all the deadlines exists with fixed priorities, the RM algorithm will produce a feasible schedule.
Analysis of fixed-period/priority systems • For any task with an execution time of ei time units, the response time Ri is where is the max possible delay in execution (caused by higher priority tasks) during [t, t+Ri) • The most critical time instance, when all higher-priority tasks are released along to the task , Ii has the maximum contribution for Ri. (7.7)
Analysis of fixed-period systems • Consider a task of higher priority than . • Within the interval [0, Ri), the number of releases of will be where is the execution period of . • Each release of contributes to the amount of interference from other tasks of higher priority that will suffer. (7.8)
A recursive solution to response time • Each task of higher priority is interfering with task , Hence where HRP(is) is the set of higher-priority tasks w.r.t. • Substitute this into equation 7.2. yields (7.9) (7.10) (7.11)
A recursive solution to response time • Due to inconvenience ceiling function, it is difficult to solve for Ri directly. A net recursive solution is provided as following: • Compute the consecutive values of iteratively until the first value of m is found such that = • If the recursive equation does not have a solution, the value of will continue grow. • As in the overloaded case: a tasks set has CPU utilization factor greater than 100%. (7.11)
Example: compute response time in a rate-monotonic case • Consider a task set to be scheduled rate - monotonically as shown below • Let first calculate the CPU utilization factor, U to make sure the RTS is not overload.
Example: compute response time in a rate-monotonic case • The highest priority task has a response time equal to its execution time, so R1 = 3 • The medium or lowest priority taskand has its response time iteratively computed according to equation 7.11.
Analysis of non-periodic systems • In practice, a RTS having one or more aperiodic or sporadic cycles could be modeled as a rate-monotonic system, • the non-periodic tasks is approximated as having a period equal to their worst-case expected inter-arrival time. • If this rough approximation leads to unacceptable high utilizations, use some heuristic analysis instead. (Queuing theory)
Response times for interrupt-driven systems • The calculation depends on several factors • Interrupt latency • Scheduling/dispatching times • Negligible when CPU uses a separate interrupt controller supporting multiple interrupts • Can compute using simple instruction counting when a single interrupt is supported with an interrupt controller • Context switch times • Determination of context save/restore times is similar to execution time estimation for any application code
Interrupt latency • It is a varying period defined between • a device requests an interrupt and • the first instruction for the associated interrupt service routine executes. • Worst case interrupt latency • Occur when all possible interrupts in the system are requested simultaneously • Main contributors • Number of tasks, as RTOS needs to disable interrupts while it processing lists of blocked or waiting tasks • Perform some latency analysis to verify that OS is not disabling interrupts for an unacceptably long tome. • In hard RTSs, keep tasks # as low as possible
Time needed to complete the execution of a particular ML instruction being interrupted. • Find the WCET of every ML instruction by measurement, simulation or manufacturer’s datasheet • Instruction with the longest execution time will maximize the contribution to interrupt latency if it just begun executing when the interrupt request arrives. In a certain 32-bit MCU, • all fixed-point instructions take 2µs • Floating point instructions take 10µs • Special instructions like trigonometric function take 50µs
Deliberate disabling of interrupts by RT software • Interrupts are disabled for a number of reasons • Protection of critical regions • Buffering routines • Context switching • So, allow interrupt disabling by system software, not application software.
Architecture enhancements render a system unanalyzable for RT performance • Instruction and data cache • Fetch instructions from slower main memory • A time-consuming replacement algorithm tor bring the missing instructions into cache • Instruction pipelines • Assume that at every possible opportunity, the pipeline needs to be flushed. • Direct memory access (DMA) • Assume that cycle stealing is occurring at every chance, inflating instruction fetch times They improve average computing performance, destroy determinism, and thus make prediction troublesome.
Review: DMA controller • During a DMA transfer, the ordinary CPU data transfer process cannot proceed. • CPU proceeds only with nonbus-related activities • CPU cannot provide service for any interrupts until the DMA cycle is over • Cycle stealing mode • No more than a few bus cycles are used at a time for DMA transfer • Thus, a single transfer cycle of a large data block is split to several shorter transfer cycles
Discussions • Traditional worst-case analysis leads to impractically pessimistic outcomes. • Sol: Use probabilistic performance model for caches, pipelines and DMA. • Definitely meet all the required deadlines, but it is sufficient to have a probabilistic guarantee very close to 100% instead of an absolute guarantee • Practical relaxation dramatically reduces the WCET to be considered, as in schedulability analysis • But in hard RTSs, it remains problematic to use the advanced CPU and memory architectures.