Advanced Embedded Systems

Advanced Embedded Systems Lecture 12 Multiprocessors in Embeded Systems (2)

Advanced Embedded Systems • Unlike RTOSs, which provide apparent concurrency on a single processor, multiprocessor platforms offer true concurrency; high performance but difficult to analyze and debug; • Software on EMs rise two types of problems: • What differences are between the embedded multiprocessor software and the general multiprocessor software; solutions from general purpose computing may be used but new methods must also be foresight; • Does the embedded multiprocessor software can be seen as an extension of the single embedded processor software? Some embedded applications can be ported from a single processor to multiprocessors but, generally, there are important differences; • Difference: EMs are often heterogeneous (less used in GPMs); heterogeneity rises several problems: • Putting together the software of different processors can be difficult: endiannes and library compatibility problems are classic; • The development environments are often loosely coupled requiring much time for learning all the tools and increasing debug difficulties; • Different processors may have different resources and interfaces;

Advanced Embedded Systems • Difference: delays are harder to predict in multiprocessors; • Sources for delay variations: • True concurrency provided by multiprocessors; • Larger size of multiprocessors; • CPU heterogeneity; • Structure of the memory; • Larger delays and variances in delays generate problems, such as: • Timing bugs are hard to test and to fix; • Variations in computing time make it hard to efficiently use system resources and require more decisions to be made at runtime; • Code that performs data dependent operations is harder to be executed because of large delays for memory accesses; • Scheduling a multiprocessor is more difficult than scheduling a uniprocessor; • Heuristics must be used because optimum scheduling algorithms do not exist for most EMs; • The processors may not use the same information for taking scheduling decisions; • One of the reasons for the difficulty is that communication is no longer free; the transfer of information takes tens of clock cycles and scheduling decisions must, sometimes, to be made without full information;

Advanced Embedded Systems • Embedded multiprocessor operating systems may be implemented in several ways: • Single processors may run their own OS which communicate to coordinate their activities; • Several processors may run a more tightly coupled OS; • A simple form of MOS is: 1 master – many slaves: • The master PE takes care of the schedules for itself and for the slaves; it keeps all the information needed for scheduling; • The slave PEs run the processes allocated by the master; • This solution is suited for homogeneous multiprocessors; • When a master takes a scheduling decision, it may receive information from the slaves: • In an homogeneous system, the interconnection network is regular, the processors and the memories are identical, so the information will be rapidly received; • In an heterogeneous system, the time will be longer and possibly more information will be needed;

Advanced Embedded Systems • Case study: the OS from TI OMAP • The OMAPI standard does not define an OS and either does not run a single unified OS; each processor runs its own OS; • The main unifying structure in OMAP is the DSP Bridge which ensures the communication between the DSP and RISC processors; • The bridge is organized as a master/ slave structure in which the ARM is the master and the C55x is the slave; this is appropriate for most multimedia applications, where the DSP implements certain key functions while the RISC processor runs higher levels of the application; • The DSP Bridge API implements functions such as: initiates and controls DSP tasks, exchanges messages with the DSP, transfer of data and status with the DSP etc.

Advanced Embedded Systems • Multiprocessor scheduling: is NP-complete: if we want to minimize the total execution time on an arbitrary processor, there is no known way to find the shortest schedule in polynomial time; • Near optimal solutions can be obtained through heuristics and making approximations; • The Stone’s algorithm: • It selects the CPU which execute a process and the corresponding time; • It considers heterogeneous processors; • It gives an exact solution to the two-processor scheduling problem and heuristics for the multiprocessor scheduling problem; • The solution is based on an intermodule connection graph which describes the cost of communicating between two processes that are assigned to different processors; the cost for communication between processes on the same processor is zero; • The solution uses also an execution time table, which specifies the execution time of each process on each processor; it is possible that not all processor may run on both processors;

Advanced Embedded Systems • The minimum running time is a compromise between the communication cost and the execution cost; • The scheduling problem was formulated as one of finding a cutset of a modified version if the intermodule connection graph; for that: • Two additional nodes are added to represent the processors; one is the source of the graph (CPU1) and the other is the sink (CPU 2); • Edges are added from each non-sink node to the source and the sink: • The weight of an edge to the source = the cost of executing that node’s module on CPU2 (the sink); • The weight of an edge to the sink = the cost of executing that node’s module on the CPU1 (the source);

Advanced Embedded Systems • The cutset divides the intermodule connection graph into two sets, with the nodes in each set being assigned to the same processor; • The weight of a cutset is the cost of an assignment of the nodes to the two processors as given by the cutset; • The allocation which minimizes the total execution time is the solution of a maximum flow problem on the graph; • The extension of the solution to n processors was done by generalizing the notion of a cutset: • This divides the graph into n disjoint subsets; • The node was generalized in order to include n types of different nodes; • The heuristics consisted in using iteratively the two-processor solution for finding the n-processor assignment; • In many ESs processes are allocated statically to processors: • Generally, bounds on the execution time of the processes can be obtained efficiently; • Schedules are more difficult to obtain if there are data dependencies between the processes

Advanced Embedded Systems • Data dependency is illustrated in next fig. • Processor M1 runs processes P1 and P5, processor M2 runs processes P2 and P4 and processor M3 runs process P3; • Data dependencies exist between processes P1 and P2, P4 and P5 and P3 and P4; • The completion time of the processes on M2 depends on the behavior of the processes running on all the other processors in the system; for example, the response time of P3 may influence the computation time of P4 and P5 and because they share the same processors with other processes (P2 and P1), those processes may also be influenced;

Advanced Embedded Systems • Next fig. presents a methodology for distributed implementations of signal processing algorithms:

Advanced Embedded Systems • The designer provides the task graph, the hardware platform architecture and design constraints, such as deadlines; • The designer provides statistical information about process behavior in form of probability distribution functions; • During synthesis, each subgraph of the task graph is treated as an independent channel; • Time is divided in slots and each process is assigned a time budget and a position in the schedule called interval; • The load threshold estimator determines the schedule for processes; it estimates the throughput required for each channel, allocates processes to PEs trying to balance load and schedules the intervals to minimize latency; • To analyze a channel, the estimator breaks it into chains, analyzes each chain individually and then combines all the results obtaining an overall estimate for the whole channel; • The system behavior must be validated through simulation because the load may vary;

Advanced Embedded Systems • Scheduling with dynamic tasks rises several problems; • A system that accepts dynamic tasks must decide on-the-fly whether it can accept such a task; in case of a multiprocessors, it must be decided also which processor executes the task; it is possible that a PE does not accept an incoming task because it is occupied; • If the task is accepted by a PE which executes other tasks, the scheduling overhead will steal time from the other tasks; longer delays for scheduling will shorten the execution time; • Scheduling is easier in GPMs because they are homogeneous; • In case of EMs, which are, generally, heterogeneous, not all tasks may run on all the PEs: • This may simplify some scheduling decisions, for example a task that runs only on one specific PE will always use that PE; • This complicates other problems; for example if a PE is able to run general-purpose tasks as well as specialized tasks, it may accept general-purpose tasks before a specialized one; the OS can either reserve the PE only for specialized tasks, thus wasting resources, or it must move tasks, on-the-fly, to make room for the specialized task;

Advanced Embedded Systems • Load balancing is a form of dynamic task allocation; the tasks come from internal resources; • In order to balance processor loads, the possibility to stop a task running on one PE, to save its state and restore that state on the new PE must be provided; this procedure is called task or process migration; • It can be difficult depending on the situation: • Homogeneous multiprocessors with shared memory: the solution is to copy the task’s activation record from the old PE to the new one and restart the PE; • Heterogeneous multiprocessors with shared memory: it must be assumed that versions of the code run on both old and new PEs; in many cases there may be a straightforward transformation between the activation record information of the two PEs, so a simple copy of the activation record from one PE to another one may not be possible; usual, specialized code must be used in the processes that will save the task state in memory, so that the explicitly transfer of the activation record will not be necessary; • Multiprocessor with nonshared memory: if the PEs do not execute tasks in a shared memory, all the programs must be copied from the old PE to the new one; this is generally expensive;

Advanced Embedded Systems • A load balancing algorithm for real time multiprocessors: • Each PE in the system owns a so called buddy list of nodes with which it can share tasks; • A possibility to determine the buddies, for example, is using the communication cost between the processor; • The PEs send information about their state to other PEs on their buddy list; the operation is done in order to obtain quick scheduling decisions; • Each PE may be in one of the following states: underloaded, medium loaded or fully loaded; • When a PE’s state change, it updates all its buddies; • Next, each PE organizes the buddy list into a preferred list, which is ordered, for example, by the communication distance of the processing element with all the other PEs; • When a PE wants to move a task to another PE, it searches for the first underloaded PE on its preferred list; • By properly establishing the preferred PE lists, it can be ensured that each PE is at the head of no more that one other PE; • This will reduce the chance that a PE will be overwhelmed by multiple task migration requests;

Advanced Embedded Systems • Services for EMs • Are used for building applications; • Examples: low-level operations, I/ O device handling, interprocessor communication, scheduling etc. • Middleware: software that provides services for applications in multiprocessors and distributed systems; • Middleware may provide generic data services, such as data transport among processors that may have different endianness, application specific services etc. • Embedded systems include middleware because: • Provides basic services that allow applications to be developed more quickly; those services may be specific to a certain PE or to an I/ O device; • Simplifies porting applications from one embedded platform to another; middleware standards are particularly useful since the application itself can be moved to any platform that supports the middleware; • Ensures that key functions are implemented efficiently and correctly; rather than rely on users to directly implement all functions, a vendor may provide middleware that showcases the features of the platform;

Advanced Embedded Systems • One of the major differences between middleware and software libraries is that middleware manages resources dynamically; • In an uniprocessor, the OS manages the resources and software libraries perform computational tasks based on those allocations; • In a multiprocessor, middleware allocates system resources, sending commands to the operating systems on the individual PEs; • The dynamically allocation relies on the fact that the tasks performed by the system may vary over time; • If the resources are allocated statically, an expensive and high consuming solution will be obtained; • Dynamic allocation allows a more efficient use of resources, increasing the chances to manage cases in which there are not enough resources to handle all the current requests; • ESs increasingly employ middleware because they must perform complex tasks whose resource requirements cannot be easily evaluated statistically; • Ex.: the stack in an ES vary from the stack in general purpose computer;

Advanced Embedded Systems • Standard-based services • Several middleware systems have been developed based on combinations of standard services: the Internet Protocol, CORBA etc. • MPI (MultiProcessor Interface) is a specification for a middleware interface for multiprocessor communication; • It provides a set of communication services based on a few communication primitives; • MPI does not itself define the setup of the parallel system, meaning the number of nodes, the mapping of processes or data to nodes etc.; this setup is done before the MPI starts; • A MPI program gives names to the nodes, is able to change the number of nodes used in an application and the allocation of the programs to the nodes; • There are basic communication functions which ensure point-to-point, blocking communication; • MPI allows to create groups of processes; they can be defined either by name or by topology; after their creation, the groups can perform multicast and broadcast communication; • The MPI standard includes 160 functions, the kernel being made by only six; all the other functions are implemented in terms of the six primitives;

Advanced Embedded Systems • System-on-Chip (SoC) services • When systems-on-chips appeared, a new type of custom middleware was developed; it relied less on standard services and more on the particular features; • One reason: these systems are often constrained in power and energy and any services must be implemented efficiently even if standardization must be sacrificed for that; • Another reason: although system-on-chips may be committed with outside standard services, they are not obliged to use standards inside the chip; • Another reason: today’s SoCs are composed of a relatively small number of processors so customized middleware is more efficient; if the number of processors will increase, probably the industry standard services will be more useful; • Next fig. shows a typical software stack for an embedded SoC multiprocessor:

Advanced Embedded Systems • The hardware abstraction layer provides a uniform abstraction for devices and other hardware primitives; • The real time operating system controls basic system resources such as process scheduling and memory; • The interprocess communication layer provides abstract communication services; • The application specific libraries provides utilities for operations specific to the application; • The application code uses these layers to provide the end service or function;

Advanced Embedded Systems • Middleware and services for the TI OMAP • The fig. shows the layers of software in an OMAP system: • The DSP provides a software interface to its functions; the C55x supports a standard, so called eXpressDSP, for describing algorithms; it hides some of the memory and interfacing requirements of algorithms from application code; • The DSP resource manager provides the basic API for the DSP functions; it controls tasks, data streams between the DSP and CPU, memory allocation; • The DSPBridge is an architecture-specific interface; it provides abstract communication but only in the special case of a master CPU and a slave DSP; this is an example of a tailored middleware service;

Advanced Embedded Systems • Quality-of-Service (QoS) • QoS for processes means that a process will be reliably scheduled periodically with a given amount of computation time; • Some scheduling techniques, such as RMS, inherently ensure process level QoS, while for others solutions must be found for providing the needed level for QoS; • QoS can be modeled with three basic concepts: • A contract specifies the resources that will be provided; the client may propose a set of parameters, such as the amount of bandwidth, the rate of missed packets, but the server may counterpropose a different set of parameters, based on its available resources; • A protocol manages the establishment of the contract and its implementation; • A scheduler implements the terms of the contract, setting the sizes of buffers, managing bandwidth etc. • It must be ensured that QoS processes obtain the resources they need and when they need, to meet their deadlines: • Resource management algorithms that avoid deadlock and minimize scheduling delays are not sufficient; • Reservation methods are a solution for ensuring that the resources are available when they are needed;

Advanced Embedded Systems • A methodology for generating QoS software that can handle critical communication is: • The system being controlled, the plant, is under the control of a quality manager and a scheduler; they form a controller; • The controller generates schedules for a given level of quality and chouses one, after evaluating them; • Next, the controller determines the feasibility of the schedule, given an execution time for the code needed for implementing the QoS action; • Different schedules can be obtained with different QoS levels;

Advanced Embedded Systems • Design verification • Verifying multiprocessor software is harder than verifying uniprocessor software because: • The data of interest may be harder to observe and control; • Desired states for certain parts of the system may be harder to be reached; • Timing effects are harder to generate and test; • It is not necessary to debug the entirely software on the target platform; test and debug platforms are needed for the entire verification and then the software must be ported on the target system; the characteristics of those platforms must be taken into account; • However some problems may arise: port characteristics, memory mappings, interrupts, real time aspects etc. • In case of ESs, the software must be verified not only for correctness but also for performance, power, energy and size constraints; real time aspects must be verified; • Multiprocessor simulators are useful to verify also those nonfunctional properties but the accuracy must be high (at cycle level);

Advanced Embedded Systems • The CoMET Simulator: • Simulates EMSs; • The processor model is called virtual processor model (VPN); • Part of the VPN is obtained from the application code and this custom model reflects the behavior of the code that can be determined statically; • The other part of the VPN includes part of the processor that must be modeled dynamically: I/O, memories etc. • The simulator framework contains a backplane that connects the various virtual processor models and also other hardware models; • The MESH Simulator: • Is a simulator for heterogeneous EMSs; • It uses logical and physical time; the events may be mapped onto physical time or onto logical sequences; • The simulation is done in logical time and then mapped into physical time; • Because the complexity of events is diverse, the simulator works with macro events and micro events; each macro event is a sequence of micro events;

Advanced Embedded Systems

Advanced Embedded Systems Testing and debugging EMSs and ESs • Hardware tools • Multimeters: measure voltage, current, power and validate inputs and outputs; are useful for static measurements so the execution must be stopped or static (step by step); • Oscilloscope: allows measurements of signals varying in time; • May have one or several inputs; • May memorize signals, in this case being useful for measuring singular signals and for tracking signal sequences; • Generally has multiple triggering possibilities allowing to monitor external events and measurements of internal states when external events happen; • Logic Analyzer: • Basically, it can be used to capture data or events, to measure individual instruction times or the duration of sequences, to establish synchronization moments etc. • More complex logic analyzers include disassemblers and decompilers for source-level debugging and performance analysis as parts of integrated debugging environments;

Advanced Embedded Systems • Timing instructions: to time an individual instruction, one must find the memory location in the code segment of memory containing the desired instruction; then, the logic analyzer must be set to trigger on the opcode from the location and on the opcode and location of the next instruction; it must also be set to trace for absolute time; the logic analyzer will display the time between the fetch of the target instruction and the next instruction, this being the most accurate method for determining the instruction execution time; • Timing code: the analyzer must be set to trigger on the starting and ending address and opcode of the first instruction of that cycle; then, the interrupts must be disabled; the duration of the cycle will be displayed; the duration of the whole cycle, of modules within the cycle or of sections of code within a module can be obtained; • Example: location opcode instruction 2356 6300 DPI 2357 2701 1000 LOAD R1,1000 2359 1401 . . 264B 6301 EPI

Advanced Embedded Systems • In-Circuit Emulator: • Emulates the target CPU; • It is connected to a PC and the target system where it will replace the processor; through the PC, the application programs can be created and verified; commands as single step, fill memory, dump memory, set the program counter are available and help to debug the programs; • The in-circuit emulator plugs into the chip carrier or card slot normally occupied by the CPU; • Attention must be paid when ICE is used in timing tests because it can introduce delays;

Advanced Embedded Systems • Development System (Starter Kit): • They have similar functional possibilities as in-circuit emulators but are not directly connected to the target system; • A DS is a system based on the same processor as the one from the target system (or from the same family) with a classical structure: CPU, program and data memories, I/ O blocks; it is connected to a PC, usual through a serial interface; • The applicative programs are created on the PC, loaded in the DS and executed in order to be tested; • There are some differences from the real applicative programs because of different hardware features, such as port addresses, memory addresses etc.; • It is difficult to verify the interrupts, generally the real time features; • In most cases, DSs include also a prototyping zone, where the hardware of an application can be created; • The DSs include in their package software support too, made by: assembler, linker, C compiler and other debugging tools; debugging commands will be launched from the PC; • DSs and ICEs are more or less versatile, depending on the number of processors or even families of processors supported;

Advanced Embedded Systems • Software tools • Simulator • Runs on a PC and models the execution of all the instructions, and consequently of the programs too, of a target processor or family of processors; it is less appropriate for critical real time applications because introduces delays compared to the real system; • Monitor • Runs on a PC and allows to edit and create programs for the target processor, loads them into it and tracks their execution; includes commands through which the debugger can monitor the behavior of the target system; it helps also to program the code memory of an microcontroller; • Assembler • Transforms a source file, in the processor’s assembly language, in object file and then in executable or hex files; • Compiler • Transforms a high level language source file in executable code;

Advanced Embedded Systems • System integration methodology • There are several strategies for performing system integration; they are not mutually exclusive and can be combined; • Establishing a Baseline • The first goal in integrating the ES is to ensure that each task is properly scheduled and dispatched; it must be verified that each task is running at its prescribed rate and that the task context is saved and restored; • The process whereby all tasks are appropriately scheduled is called cycling; • A logic analyzer is useful in verifying cycle rates by setting the triggers on the starting locations of each of the tasks involved; • The application code associated with each task will be added only after the system cycles properly; • Backoff Method • After the establishment of the baseline, modules are added; the idea is to make only one change at a time; • Once a module was added, the system is tested and if it doesn’t cycle a back off step must be done; the added module is responsible for this situation and it must be debugged, or patched;

Advanced Embedded Systems • The method is shown in the next figure: • The baseline does not contain application code; it ensures that interrupts are handled properly and that all cycles are respecting their rates, without worrying about interference with the application code; • After adding sections of application code, the cycle rates are verified; • If an error is detected, it is patched (if possible); • If the cycle rates are restored properly, then more code is added;

Advanced Embedded Systems • Patching • Patching is the process of correcting errors directly on the target system; it leads to change memory contents; • Patching is used in ESs because it would be too time consuming to correct the error in source code, recompile or reassemble, relink and download the code into the target system; • Patching requires an expert command of the opcodes for the target system unless a macroassembly level patching facility is available; • It also requires an accurate memory map, with the contents of all locations and a method for memorizing directly in locations; • Patching is analogous to placing jumper wires on prototype hardware; • An in-line patch means that the modification fitted in the existent space for the code:

Advanced Embedded Systems • An oversized patch means that the modification needs extra memory space; the solution consists in a jump to an unused portion of memory, where the patched code will be loaded and then a jump back to the next instruction which follows the first jump; • The loading of patches during system integration can be automated using batch files; • A large number of patches will be confusing; a careful record of all patches must be kept; • Patches must be eliminated in the final system and final testing must be done only on rewritten software, without patches;

Advanced Embedded Systems • The software Heisenberg uncertainty principle • The Heisenberg uncertainty principle comes from physics and states that one cannot know exactly the position of a physical particle and its momentum simultaneously; it is reflected in the relationship: ∆p x ∆x ≈ h, ∆p = the uncertainty of position, ∆x = the uncertainty of momentum and h is the Planck’s constant; • Moving to software, it can be stated: ∆r x ∆s ≈ H, ∆r = the uncertainty of the code, ∆s = the uncertainty of test specifications and H is some constant; • More closely a system is examined, more likely the examination process will affect the system being tested; • ESs are affected by this principle because the test software affects timing; • Software reliability is also affected by the extra test code;

Advanced Embedded Systems • Let the following code: LOAD R1,0 LOAD R2,1 STORE R1,intclr ; set clear interrupt signal low STORE R2,intclr ; set clear interrupt signal high STORE R1,intclr ; set clear interrupt signal low EPI ; enable interrupt • It is an ISR in which the interrupt request is cleared and the interrupt system is immediately enabled for detecting spurious interrupts; • Let suppose that the interrupt request is 4 μs long and the STORE and LOAD instructions take 0.75 μs each; • If the code is executed immediately after the request has arrived, the interrupt system will be enabled when the request is still active and a spurious interrupt will be detected; • Single-stepping through the code will mask this problem, since the time between instructions is increased; the test process introduced an uncertainty; • Nonintrusive testing should be considered.

Advanced Embedded Systems