340 likes | 370 Views
Advanced Embedded Systems. Lecture 11 Multiprocessors in Embedded Systems (1). Advanced Embedded Systems. Embedded multiprocessors: homogeneous or heterogeneous; A multiprocessor is made of: Processing elements; Memory blocks; Interconnection networks;. Advanced Embedded Systems.
E N D
Advanced Embedded Systems Lecture 11 Multiprocessors in Embedded Systems (1)
Advanced Embedded Systems • Embedded multiprocessors: • homogeneous or • heterogeneous; • A multiprocessor is made of: • Processing elements; • Memory blocks; • Interconnection networks;
Advanced Embedded Systems • Embedded multiprocessors vs. typical multiprocessors: • Different types of PEs: PEs with different features, PEs programmable and non-programmable; • Memory blocks with different sizes; private and shared memory blocks; • Specialized interconnection networks; • Both have to offer high performance but EMs must add: • Real time performance: scientific multiprocessors improve average performance at the expense of predictability; EMs must offer predictable performance; • EMs must frequently run at low energy and power levels; low power reduces heating problems and cost, while low energy consumption increases battery life; typical multiprocessors are less sensitive to power and energy consumption; • EMs must be cost-effective: they must provide high performance without excessive hardware; • Design techniques: • Heterogeneous multiprocessors are more energy efficient and cost effective than homogeneous multiprocessors; • Heterogeneous memory systems improve real time performance; • Networks-on-chip;
Advanced Embedded Systems • The combination of high performance, low power and real time leads toward heterogeneous multiprocessors: • It is desirable to specialize the blocks of an EM: the processing elements, the memories and the interconnection network; • Specialization leads to lower power consumption; examples of operations needing specialized hardware: • Bit level operations: in a CPU, it requires too many registers; • Intensive input/ output operations: if data must be read, processed and written to meet a tight deadline, for example in an engine control; • Heterogeneity reduces power consumption because unnecessary hardware is removed; additional hardware is always necessary for generalizing functions; • Drawback: specialization increases communication; • Using multiple CPUs can increase real time performance; allocating time for critical processes on separate CPUs helps to meet deadlines; • Specialized memories and interconnections increases the predictability of the response time of a process;
Advanced Embedded Systems • Embedded multiprocessors design techniques: • Design methodologies; • Modeling and simulation; • Multiprocessor design methodologies:
Advanced Embedded Systems • The program used to design and evaluate the EM is called workload (benchmarks in computers); • Many such programs are not written for ESs (real time performance, low power, limited memory) and their use may lead to wrong decisions; a workload must be tailored to EMs requirements with platform-independent optimizations; • Next, platform-independent measurements must be performed for defining an architecture; examples are: dynamic instruction count and data access patterns; they show how close is the workload to the EM which must be designed; • An initial candidate architecture is delimitated; platform-dependent characteristics are measured and the architecture is evaluated; if the platform is not appropriate it is modified and new measurements are done; if it is appropriate, the blocks of the EM are designed; • The software is mapped onto the platform; during this phase, compilers and libraries may be useful; most of the optimizations are platform dependent; operations must be allocated to processing elements, data to memories and communications to the interconnection network;
Advanced Embedded Systems • Multiprocessor modeling and simulation: • Most multiprocessor simulators are systems of communicating simulators; the component simulators are PEs, memory elements and interconnection networks; the simulator itself ensures the communication between those component simulators; • The multiprocessor simulator can be built using techniques of parallel computing: • Each component simulator is a process both in the multiprocessor simulator and in the host CPU’s operating system; • The operating system provides the abstraction necessary for multiprocessing: each simulator has its own state, just as each PE in the implementation has its own state; • The simulator uses the host’s computer communication mechanisms, such as semaphores, shared memory and so on, to manage the communication between the component simulators; • Simulators for classical multiprocessors assume that all the PEs are the same type; they must be adapted to heterogeneous multiprocessors which requires additional software;
Advanced Embedded Systems • Multiprocessor architectures: • The ESs separated or • The ESs implemented on the same chip, known also as multiprocessor system-on-chip (MPSoC); • Philips Nexperia: MPSoC for digital video and television applications:
Advanced Embedded Systems • It includes two processors: MIPS PR3940 RISC CPU running the real time operating system and Trimedia TM32 VLIW processor for media operations; • It includes a synchronous DRAM satisfying the requirements of the video memory; the memory controller is connected to the rest of the circuit through a bus; • The MIPS processor is connected to a fast bus and this one is connected to a slower bus for the low speed peripherals through a bridge; the TM32 processor has its own bus; • Various peripherals are implemented on the chip: a USB controller, 3 UARTs, 2 I2C interfaces, digital audio interfaces, general-purpose I/ O pins; • The circuit contains special-purpose function units and accelerators for media applications: • An image composition engine, a scale unit, a MPEG-2 video decoder, two video input processors that can be used to receive the NTSC and PAL broadcast standards, a drawing engine; • These units bring efficiency by off-loading some work from the CPUs;
Advanced Embedded Systems • TI OMAP Multiprocessor • It was designed for mobile multimedia applications: camera phones, portable imaging devices and so forth; • The OMAP standard conforms to the OMAPI standard which defines hardware and software interfaces for multimedia multiprocessors; • The fig. shows the overall structure of the OMAP hardware/ software architecture; it is based on a RISC processor, an ARM9, and a DSP, a TI C55x; the two processors communicate through a shared memory;
Advanced Embedded Systems • OMAP 5912 • It contains a frame buffer for video as a separate block of memory, distinct from the main data and program memory; the frame buffer is contained on-chip while the flash and SDRAM memories are off-chip; • There are 4 mailboxes, in hardware, for multiprocessor communications; two are writable by the ARM9 and two are writable by the C55x; all are readable by either processor; • Each processor has some dedicated I/ O devices; there are also some common devices accessible through a peripheral bridge;
Advanced Embedded Systems • The components of an EM are: • Processing elements; • Memories; • Interconnection networks; • The processing elements perform the computations; a PE may run only one process or may run several processes; frequently, an EM uses different CPUs for implementing the PEs: programmable processors, hardwired processors, single-function blocks etc. • For determining the number of PEs and their type the following design methodology is recommended: • Analyze each application to determine the performance and power requirements of each process in the application; • Choose a processor type for each process, usual from a predetermined set of processor types; • Determine which process can share a CPU to determine the required number of PEs; • Software performance analysis can be used to determine how fast a process will run on a particular type of CPU; • Standard CPUs or configurable processors can be used;
Advanced Embedded Systems • The memory system is a classical bottleneck in computing: the memories are slower than processors and, worse, processor clock rates are increasing much faster than memory cycle times decrease; • Traditional parallel memory systems • Used in classical multiprocessors; memories are homogeneous; • Each bank is separately addressable; • If there are n banks, n accesses can be performed in parallel, offering the peak access rate; it can be achieved only in particular cases, for example if the banks are accessed in the order 0, 1, 2, 3, 0, 1, 2, 3, … • In reality, the probability of a k long sequential access sequence is: , where λ is the probability of a nonsequential memory access (for example a branch);
Advanced Embedded Systems • Heterogeneous memory systems: are preferred in EMs but can coexist with homogeneous memory systems; • HMS improve real time performance: • Common memories are good when we are concerned only by functionality and less when real time performance and predictability are desired; • If a memory block is shared by several PEs, they will contend for that memory; in general, one PE will have to wait for another PE to finish its access; in most cases it is not possible to predict when these conflicts will occur; • Avoiding conflicts can be guaranteed if only one, or a few, PEs access a memory, that is if a specialized memory for those PEs was foresight; • HMS contributes to reduce power consumption: • One component of the power consumption when a memory access is done, is given by the size of the memory block (because of the access time); • A heterogeneous memory can be built with smaller memory blocks, reducing the access time, thus the power consumption; • Energy per access also depends on the number of ports on the memory block, so reducing the number of the units that can access a given part of memory leads to a reduction in the energy consumption;
Advanced Embedded Systems • Interconnection networks • Connect the PEs to the memories; • Terminology: • Client: a sender or receiver connected to a network; • Port: a connection to a network on a client; • Link: a connection between two clients; • Half-duplex and full-duplex: … • Topology: organization of the links; determines properties of the network; • Attributes for evaluating and comparing the INs are: • Throughput: the maximum available throughput from one node to another and the variations in data rates over time and the effect of those variations on network behavior are useful; • Latency: the amount of time it takes a packet to travel from a source to a destination is of interest; also, the best-case and worst-case latency are important when the latency varies; • Energy consumption: a typical measure is the amount of energy required to send a bit through the network; • Area: influences the cost and the dynamic energy consumption (the metal area of the wires); the total area is given by the metal area of the wires and the silicon area of the transistors;
Advanced Embedded Systems • The simplest interconnection network is the bus • Small size, low performance, high energy consumption; • For estimating the performance it is assumed that the bus is operated by a master clock; • Considering an one word per bus transaction, the bus throughput is: words/ sec.; P = clock period, C = no. of clock cycles required for transaction overhead (addressing, etc.); • If the bus supports block transfers, then the block transactions of n word blocks is: , words/ sec. • The main part of the energy consumption is due to the dynamic energy consumption; this is determined by the capacitance that must be driven; • The capacitance of a bus is given by two components: the bus wires and the loads at the clients; if the number of clients is large, this capacitance becomes important; • The energy consumption may be high because of the length of the wires; • Bus is not recommended because it becomes easily saturated with traffic, so a small number of PEs can be connected;
Advanced Embedded Systems • The crossbar: the most complex IN: • Is a fully connected network; it provides a path from every input port to every output port; ex. of a 4 x 4 crossbar: • Provides full connectivity to any combination of inputs and outputs; • Broadcast from an input to all outputs and multicast from an input to several selected outputs is possible; • The disadvantage is its size: for n inputs and n outputs n2 switches are necessary; however, because of the simplicity of the switches and their small sizes, crossbars for moderate number of inputs and outputs (for example 8 x 8 with words of reasonable width) can be built in a modern VLSI chip; a 10000 x 10000 crossbar for even 1 bit wide word is not reasonable;
Advanced Embedded Systems • If the number of inputs is too large, for a given area of the crossbar, the solution is to use buffers; • Queues can be added to the inputs of the crossbar, several sources of traffic being connected to a queue; a queue controller is needed to decide the order in which the packets will enter in the queue and what to do when the queue is full; • Buffers can be added to switches; this will increase the physical size but also the flexibility in transfers;
Advanced Embedded Systems • Mesh networks: • Every node is connected to all of its neighbors; • A mesh network is scalable in that a network of dimension n + 1 includes subnetworks that are meshes of dimension n; • The links are short but their number is high establishing multiple paths for data; • The shortest path between two nodes is equal to its Manhattan distance, which is the sum of the differences between the indexes of the source and destination nodes;
Advanced Embedded Systems • Application-specific networks: are appropriate for ESs; • It is a topology matched on the characteristics of the application; • ASNs are less energy consuming than a regular network of equal overall performance; • Because most embedded applications perform several different tasks simultaneously, different parts of the architecture require different network bandwidth; • The network becomes more efficient, without sacrificing performance for a given application, by placing bandwidth where it is necessary; • Routing and flow control determines the cost and the performance of the network; • Routing determines the paths; routing algorithms can be deterministic or adaptive, they may drop packets occasionally or guarantee packet delivery; types of algorithms: circuit switching, store-and-forward, wormhole and virtual cut-through; • Flow control determines the way that links and buffers are allocated as packets move through the network;
Advanced Embedded Systems • Networks on chips • NoCs are the interconnection networks for single-chip multiprocessors; • Each switch is connected to its four nearest neighbors with two unidirectional links and to a resource; • In a 60 nm CMOS technology: • A single chip could include a 10 x 10 mesh with switches and resources; • Each network link would have 256 data bits plus control signals; • Each switch has a queue at each input; • The selection logic at the outputs determines the order of the packets;
Advanced Embedded Systems • Another example is the SPIN network: it is a scalable network with a fat tree topology; • This topology offers more bandwidth at higher levels in order to reduce contention; • The leaf nodes are the processing and memory elements; when a PE wants to send a message to another, the message goes up, in the tree, until a common ancestor node is reached, then it goes back down; • One advantage of the fat tree topology is that all the routing nodes use the same routing function this allowing to use the same routers in all the network; • The SPIN network uses two 32 bit data paths, one for each direction, for a full-duplex communication; a router can choose any of the several equivalent paths that are available at that moment to it;
Advanced Embedded Systems • Design methodologies for NoCs were developed; ex.: a methodology for designing networks for QoS intense applications such as multimedia: • The application requirements are specified; • The performance required from the network is determined; • The topology is determined and the network is configured with PEs and memories; • The network is simulated to evaluate its actual performance; • The network may be modified based on the performance results;
Advanced Embedded Systems • Physically distributed embedded systems and networks • Frequently used for cars, airplanes etc. • These systems are more loosely coupled than multiprocessors, they generally do not share memories; • The application is distributed over the PEs; • The distributed system must provide guaranteed real time behavior; • Reasons to build network based embedded systems: • To execute tasks near the events; ex.: an engine control may ask short time delays; • Data reduction: ex.: some initial signal processing on the data inputs for reducing its volume; the allocation of these operations to a dedicated processor will fasten the process and will reduce the load on the processor that uses the data for taking decisions; • Modularity: for easier design and assembling, for easier debugging (a verified module can be used to probe components in another part of the network), for fault tolerance; • The design of a distributed embedded system is an example of hardware/ software co-design since both the network topology design and the software running on the network nodes design must be thought together;
Advanced Embedded Systems • Time-triggered architecture • TTA is a distributed architecture for real time control; it offers reliability for safety-critical systems and accuracy for high-rate physical processes; • It is different from other distributed architectures in that it takes time into account; • TTA represents time as a 64 bit value, with the three lower bytes meaning fractions of seconds and the five upper bytes meaning seconds; • Next fig. presents the communication network interface; it links the communications controller, which is the low-level interface and the host node, which is the TTA’s PE;
Advanced Embedded Systems • The TTA can be implemented on bus and star topologies; • A bus based system uses replicated busses; they are passive to avoid components that may fail; • Each physical node is made by a node, two guardians and a bus transceiver; the guardians monitor the transmissions of the node;
Advanced Embedded Systems • FlexRay • Is a second generation standard for automotive networks; it provides higher bandwidth and more abstract services than CAN; • It is based on the TTA; • Next fig. shows a block diagram of a generic FlexRay system: • The host run applications; • The host communicates with the communication controller, which provides high-level functions and with the low-level bus driver; • Bus guardians are nodes that monitors the behavior of the network and takes actions when the behavior is erroneous;
Advanced Embedded Systems • FlexRay is organized around 5 levels of abstraction: • Physical level: defines the structure of connections; • Interface level: defines the physical connections; • Protocol engine: defines frame formats and communication nodes and services such as messages and synchronization; • Controller host interface: provides information on status, configuration, messages and control for the host layer; • Host layer: provides applications;
Advanced Embedded Systems • FlexRay has an active star topology (the router node is active): • A node may be connected to more than one star to provide redundant connections;
Advanced Embedded Systems • Data is coded with the differential non-return-to-zero scheme; • The transmission rate is 10 Mbps, independent of the length of the link; arbitration on bits is not done, so arbitration contention does not limit the link’s length; • Data is encapsulated in frames; a frame’s form is: • Frame ID: identifies the frame’s slot; its value Є {0, …, 2047}; • Payload length: gives the number of 16 bit words in the payload section; • Header CRC: provides error correction; • Cycle count: enumerates the protocol cycles; this information is used within the protocol engines to guide clock synchronization; • Data field: provides payload from 0 – 254 bytes in size; • Trailer CRC: provides additional error correction;
Advanced Embedded Systems • There are 2 timing structures: static and dynamic segment; • The static segment is scheduled using a TDMA discipline; • Static segments are divided into slots of fixed end equal length; all the slots are used in every segment in the same order; • The static segment is split across two channels; synchronization frames are provided on both channels; messages can be sent on either one or both channels; less critical messages are sent on only one channel; the slots are occupied by messages with ascending frame ID numbers; • The dynamic segment: • Provides bandwidth for asynchronous, unpredictable communication; the slots are arbitrated using a deterministic mechanism; • The dynamic segment has two channels and each of which can have its own message queue;
Advanced Embedded Systems • Because of its complex timing, FlexRay must be started properly: • The operation begins with a wake-up procedure that turns on the nodes; • Then a coldstart that initiates the TDMA process is done; • At least two nodes must have the possibility to perform a coldstart; • FlexRay has a global time source to synchronize messages: • The global time is synthesized by the clock synchronization process from the nodes’ clocks using distributed timekeeping algorithms; • The bus guardians: • Prevent the nodes from transmitting outside their schedules; • It is not mandatory to include a bus guardian in a FlexRay system, it is only recommended; • The bus guardian sends an enable signal to every node in the system it guards; by removing the enable signal, the transmission will be stopped; • The bus guardian uses its own clock to watch the bus operation; if it detects a message coming at the wrong time, it will remove the enable signal; • The controller host interface provides services to the host, regarding: status, control (interrupt service, startup), data (buffering messages) and configuration;
Advanced Embedded Systems • Aircraft networks: • The aircraft area is somehow similar to the automotive area but with more severe requirements: • The weight is a more sensitive parameter than in the case of cars; • Planes must have more complex control because they are driven in 3D; • Most aspects of aircraft design, operation and maintenance are regulated; • Aircraft electronics is divided into 3 categories: • Instrumentation; • Navigation/ communication; • Control; • Instrumentation (such as the altimeter or artificial horizon) use mechanical, pneumatic or hydraulic methods; the electronics has to display the data and send them to other systems; • Navigation/ communication: is done by radio, and is regulated; communication is done by voice or data; digital electronics control the radios and display navigation data, such as moving maps that integrate navigation data onto a map; • Control: operate the engines and flight surfaces (such as aileron, elevator, rudder)
Advanced Embedded Systems • Generally, aircraft use different types of networks, such as: • Control networks: they perform hard real time tasks for instrumentation and control; • Management networks: they control noncritical devices; they can use nonguaranteed modes, such as Ethernet, to improve average performance and limit weight; • Passenger networks: ex.: Internet service to passengers; a satellite link is used; these networks are separated from the operation networks by firewalls; • Aircraft data networks are governed by several standards; ex. ARINC 664: • It is based on Ethernet, providing higher bandwidth than previous aircraft data networks and allows aircraft manufacturers to use classical network components; • However, the basic Ethernet is used with protocols and architectures that provide the needed real-time performance and reliability; • It divides the aircraft network into 4 domains, with firewalls between them: • The flight deck network for real time control; • A network for equipment supplied by outside vendors; • A subnetwork for secondary operations, such as inflight entertainment; • The passenger subnetwork which provides Internet access to passengers.