Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip Architectures

Hardware Virtualization-driven Software Task Switching inReconfigurable Multi-Processor System-on-ChipArchitectures 黃　翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C 2012.09.02

Outline • Abstract • Introduction • Virtualization Middleware • Interconnection Design in Virtualization Middleware • Dynamic Mapping by Exploiting Permutation Networks • Classification of Tasks • Scheduling of Task Groups • Energy and Safety Aspects • Application Example and Discussion Results • Summary and Outlook

Abstract • We exploit a dedicatedVirtualization Middleware (VMW) between an array of processorsand independent software tasks. • The usuallystrict and static processor-to-task binding is resolved. • By introducing a dynamically reconfigurable interconnection network based on permutation networks inside this Virtualization Middleware, an easy mapping and scheduling of software task groups may be achieved.

Introduction (1/2) • Recent FPGAs allow the integration of up to several dozens of soft-core processors. • Given these large logic resources, at a first glance, it may appear appropriate to allocate a dedicated processor for each software task in the system in order to gain maximum performance. • However, cost as well as power constraints may render this approach unsuited for most scenarios. • On the other hand, structuring applications into tasks and let them use the same (multi-core) processor, however, may lead to unwanted security-critical situations. • Therefore, an obvious solution is to use several independent processors that may share the burden of executing software tasks.

Introduction (2/2) • Despite making things easier and faster, the employment of many processors in a SoC rises the questions of how to develop, distribute and, last but not least, to schedule the software on all these processors. • In the embedded design world a lack of experience with multi-core SoC still faces several problems: • Few mature design tools tailored to the needs of designers of multi-core SoC are available. • Methodologies aimed to design parallel or at least multi-core architectures often lack a comprehensive design flow down to the hardware layer. • Suited hardware architectures for embedded multi-processors that natively support an easy mapping and scheduling of software tasks in a SoC are still almost barely available. • Thus, regarding the last point, a generic architecture is needed in order to be able to realize complex multiprocessor SoCsthat do not waste logic resources, but otherwise assure both a safe and secure execution of software tasks.

Virtualization Middleware(1/5) • In order to shift the execution of a task from one processor to another, it is not sufficient to just reroute the connections between memories and processors. • When being executed, each task has a context residing inside the executing processor. • This context consists of the program counter address and the content of the processor registers such as the general purpose or status registers. • Therefore, this context has also to be considered for shifting the task. • The context of a task is extracted by a Code Injection Logic (CIL) that resides inside the VMW.

Virtualization Middleware(2/5) • Within this module, a dedicated portion of the so-called Virtualization Machine Code of the attached processor is stored. • To preserve the extracted context of a software task, i.e., the internal states of processors, a dedicated memory region inside the VMW is allocated for each task. • This is called the Virtualization Context Memory as seen from Figure 2.

Virtualization Middleware(3/5) • When a shift of a task execution is triggered, the CIL containing the Virtualization Machine Code is multiplexed onto the instruction interface of the corresponding processor as depicted in Figure 2. • The connection between instruction memory and processor is interrupted. • The CIL then inserts nop-commands in order to empty the five stage pipeline of the processor. • After having computed the last regular instruction of the processor code, the connections from the data memory to the processor are also interrupted.

Virtualization Middleware(4/5) • Now, the Virtualization Context Memory is multiplexed to the data memory interface of the processor. • Furthermore, the PC address register output of the processor is routed to the Context Memory. • After the program counter address of the next instruction to be regularly fetched has been computed by the processor, this address is thereby stored inside the Context Memory. • As the instruction interface of the processor is now connected to the CIL, the dedicated Virtualization Machine Code stored inside the CIL is fetched by the processor. • This code contains instructions which force the processor to dump all of its register contents on its data memory interface. • The task context that is being output by the processor is stored inside the Context Memory.

Virtualization Middleware(5/5) • After stopping the execution of a software task and extracting its context, the connections of the data and instruction memories of a task may be rerouted to some other inactive processor. • To resume software execution the CIL again feeds a dedicated portion of Virtualization Machine Code into the processor. • This code then loads the task context stored inside the Virtualization Context Memory into the register set of the processor. • After having restored the task context, the connections to the data and instruction memories are restored. • An unconditional jump to the previously saved program counter address concludes the shift of the software task and resumes task execution on the new processor instance.

Interconnection Design In VMW (1/9) • Permutation networks have already been proposed for processor-to-software communication in the past. • They consist of reconfigurablecrossbar switches that are connected by a static interconnect. • In the advocated architecture, software tasks are viewed as the inputs and processors as the outputs of a permutation network. • Crossbar switches are small routing elements with two inputs and two outputs each. • They have two configurations as depicted in Figure 3. • In their first configuration, each input is forwarded to its corresponding output. • In their second configuration, the inputs are connected to the outputs in a cross-manner. • The configuration of a switch may be changed, i.e., reconfigured, during runtime.

Interconnection Design In VMW (2/9) • In order to generate permutation networks, crossbar switches are connected to each other using static interconnects. • The design of these static interconnects determine the type of permutation network employed. • In this work, three types were evaluated. • Butterfly network • Benes network • Max-Min network

Interconnection Design In VMW (3/9) • Butterfly networks offer a relatively small resource consumption and short combinatorial paths. • However, due to their low interconnectivity, some input-output combinations are not feasible. • They are well-suited for scenarios with harsh resource constraints and few dynamic binding configurations. • A Butterfly network example is depicted in Figure 4. • Because of the severe limitations in possible I/O combinations, we did not consider Butterfly Networks for the synthesis results.

Interconnection Design In VMW (4/9)

Interconnection Design In VMW (5/9) • A Benes network appears as a Butterfly network that is doubled and mirrored. • It has a resource consumption double as high as a Butterfly network with the same number of inputs and outputs, but offers more flexibility. • A Benes network example is depicted in Figure 5. • However, with increasing numbers of inputs and outputs, routing becomes difficult. • Benes networks were discarded in favor of a network allowing for simple routing. • We, therefore, consider Max-Min networks only.

Interconnection Design In VMW (6/9) • Max-Min networks may be seen as a sorting network if each crossbar switch is used as a comparator. • If the value inserted into the first input of a crossbar switch is lower than the value of its second input, then both inputs are routed directly to their corresponding outputs. • In the other case, the crossbar switch is configured to output the inputs in a cross manner. • A Max-Min network example with eight inputs and outputs is depicted in Figure 6. • In doing so for the complete Max-Min network, at the output stage all values from the input ports are sorted. • Therefore, if every software task connected to an input of this network gets the number of the desired processor assigned, then routing the task to the dedicated processor is very easy by applying this sorting-like interconnect.

Interconnection Design In VMW (7/9)

Interconnection Design In VMW (8/9) • The advantages of these networks are weakened by the fact that their structure is not balanced. • The number of crossbar switches that have to be passed to route from an input to an output varies for different paths. • This results in varying combinatorial path delayson the chip. • Furthermore, they have the highest resource consumption of all networks discussed in the scope of this work. • Permutation networks usually do not offer a full flexibility, i.e., do not permit all possible input-to-output combinations in the same time. • Thus, some of these combinations cause so-called blockades. • Such a situation occurs if two inputs of a crossbar switch need to use the same output in order to establish their route. • As visible from Figure 7, this usually is an undesired situation.

Interconnection Design In VMW (9/9) • We will demonstrate how to make use of a certain type of blockades in order to schedule tasks that are grouped in order to share a processor resource.

Dynamic Mapping by Exploiting Permutation Networks(1/5) • Mapping of tasks to processors of the advocated architecture is accomplished by defining a Binding Vector (BV). • BVtdenotes, which software tasks are assigned to which processor at the point in time t. • A BV contains a set of processors P with elements pi and a set of tasks S with elements sj. • Software tasks may be assigned to processors using the following syntax: BVt= (pa : (sx), pb : (sy), . . . ) • Furthermore, in a BV some software tasks may selected to form a task group SGi ⊂ S, which is intended to be executed on the same processor: BVt= (pa : (SGx), pb : (sy), . . . ) • Thus, tasks that will share a processor resource, e.g., the elements of SGx, are called a task group.

Dynamic Mapping by Exploiting Permutation Networks(2/5) • An example of a BV for eight tasks being assigned to four processors in a permutation network with eight inputs and outputs each is given in the following assignment: BV1= (1:(A,D,F),3:(B,C),6:(G),7:(E:4,H)) • tasks A, D, and F have to share processor 1 • tasks B and C share processor 3 • Processor 6 is exclusively dedicated to task G. • Tasks E and H are then assigned to processor 7. • Task E furthermore features an optional budget value of 4. • It defines, how much processing time shall be granted to this task with respect to the other tasks sharing the same processor. • Based on these assignments, the mapping of the tasks to their corresponding processors through the interconnection network is accomplished by Algorithm 1.

Dynamic Mapping by Exploiting Permutation Networks(3/5)

Dynamic Mapping by Exploiting Permutation Networks(4/5) • Routing is performed by a routing logic added to the VMW. • Occurring blockades while applying the example BV are shown by highlighting the affected crossbar switches in Figure 6. • Depending on the type and complexity of the interconnection network, performing intensive back tracking, if a unwanted blockade was detected may not be feasible in terms of computing time.

Dynamic Mapping by Exploiting Permutation Networks(5/5) • Therefore, different solutions may be applicable if no solution is found after a given time. • At first, the designer may define and compute a set of BVs right from the start to determine whether they will be routable during system runtime or not. • The designer may also change the assignment of the software tasks or the number of processors employed. • Alternatively, he might switch to a permutation network with higher flexibility in order to achieve the desired binding. • In the current implementation, changing the interconnection network, however, requires a re-synthesis of the system. • Furthermore, as a fall-backalternative, the old BV may remain active, if no routing for the new BV can be found. • During the operation of the system, updated BVs may be entered at any time - either by the user or by a system scheduler instance that has previously resolved detected task dependencies.

Classification of Tasks (1/2) • Within this classification, tasks being executed in embedded systems are assigned to three types. • Tasks of the first type have to run continuously to ensure a correct system execution. • These tasks often have harsh real-time constraints. • Examples are tasks in flight management systems on airplanes or collision-avoidance systems in cars. • Since they are time-critical, the scheduling of these tasks may be a risky issue and therefore, if possible, they run on a dedicated processor. • Tasks of the second type run periodically. • These tasks usually have no hard timing requirements. • Therefore, they may be scheduled with other non-critical tasks in the system or may even be completely halted for a certain amount of time. • Examples are tasks, which periodically read out temperature sensor data such as in the engine control of a car.

Classification of Tasks (2/2) 3. Tasks of the third type are characterized by definite completion. • These tasks may perform a calculation, data transfer, or both and may terminate thereafter. • Examples are initialization routines for system start-up. • If dependencies of other tasks, which wait for a task of this type to be completed have been resolved, then this kind of task may easily be scheduled.

Scheduling of Task Groups (1/8) • Scheduling software tasks sharing a processing resource in the proposed architecture is done in a time division scheme. • The basic quantum of a time division step is a certain integer value in terms of the underlying clock cycle duration. • The optional budget parameter given in the BV determines a multiple of this basic quantum of the processing time. • The higher the budget, the more processing time is granted to a task. • A task group scheduler manages the access of independent task groups to their corresponding processor resource. • The architecture enhanced for this purpose is given in Figure 8.

Scheduling of Task Groups (2/8) Figure 8: Detailed Virtualization Middleware with Task Group Scheduler, Routing Logic, and Binding Vector Interface.

Scheduling of Task Groups (3/8) • In order to schedule tasks that are assigned to the same processorthe steps denoted in Algorithm 2 are executed. • Based on the budget value, a timer inside the task group scheduler controls, whether the task currently running has any processing time left in the current turn. • If this is not the case, then the timer triggers a scheduling event. • For the given BV1, the task group scheduling procedure results in the execution sequence depicted on the left hand side of Figure 9. • Processor 1 and 3 show the basic time division scheme, whereas on processor 7, the higher budget assigned to task E leads to a longer processing time for each of its turns

Scheduling of Task Groups (4/8) • Between each task switching, a virtualization procedure is executed.

Scheduling of Task Groups (5/8)

Scheduling of Task Groups (6/8) • The user or a system scheduling instance, which resolves task dependencies and determines which tasks may run in parallel, may change the binding vector at any time. • Algorithm 3 is executed in case that a new BV becomes available.

Scheduling of Task Groups (7/8) • After having updated the Max-Min network of Figure 6 with the new BV: BV2 = (1:(D:2,E),3:(B:2,G:5),6:(A:3,H),8:(C,F:3)) • The task execution sequence generated by the task group scheduler is given in Figure 9 b). • Note that on processor 7, task E is interrupted by the BV update although it had some granted processing time left. • After the BV update, tasks are now being executed with new time budgets and, partially, on other processors than before. • As the VMW encapsulates the memory controllers of data and instruction memories, it is possible for the VMW to read out instructions transferred from memories to processors. • By exploiting this feature, a so-called self-scheduling of task groups may be enabled. • The proposed architecture provides dedicated scheduling instructions that trigger scheduling events and may be inserted into the software tasks.

Scheduling of Task Groups (8/8) • Given a task graph, a linear execution order of tasks may be derived. • We assume each task to be of the second or third type. • If those tasks feature the dedicated scheduling instructions, they are able to indicate the end of their current computation. • This evokes the next task of the group by a virtualization event. • By means of this scheme, self-organizing list scheduling is achieved and the main scheduler in the VMW can be skipped completely.

Energy and Safety Aspects (1/3) • Some processors may remain unused depending on the contents of the BV. • For the example BV1 in Figure 6, these are the processors 2, 4, 5, and 8. • Without modification, they would remain in an infinite loop trying to fetch instructions. • This behavior, however, consumes energy. • Therefore, each processor which is currently unused, is automatically deactivated by disabling its clock input. • If shortcomings in energy supply force the system to save energy, each of the processors currently being active may be independently scaled down to one of four clock ratios, which are user-definable.

Energy and Safety Aspects (2/3) • If a shortcoming in energy supply is only transient, then a temporary re-scheduling may be applied. • By defining a new binding vector, less time critical tasks may be grouped together to share a processor. • Consequently, it takes more time for these tasks to be completed. • Some processors will then remain unused and may temporarily be deactivated. • If energy supply recovers to the normal level, then the original BV may be restored and the temporary unused processors may be reactivated. • Related to the method described above, an update of the BV during a temporary shortcoming in energy supply may also exclude several tasks from being executed. • Alternatively, instead of excluding them from the BV, their budget value may be decreased as well.

Energy and Safety Aspects (3/3) • One of the fundamental safety and security concepts in embedded systems design is to never let run software tasks relevant to security together with other software modules on the same resource. • Failures or intentionally inserted exploits in other software parts may lead to unwanted behavior of the software in terms of security. • However, by exploiting the advocated virtualization approach, software tasks are physically separated all the time. • No processor is able to address data from software which is not currently bound to it. • The context information of a task inside the virtualization memory is strictly bound to its task. • Even a harmful task, which shares its processor resource with a task relevant to security, cannot access any information of the security-relevant task, because sleeping tasks (?) as well as their corresponding Virtualization Context Memory areas are not linked to any other memory block or the processor.

Application Example and Discussion of Results (1/4) • As an example to demonstrate advantages of the proposed virtualization approach, a symmetric encryption and decryption scenario was implemented. • Various software tasks independently encrypt and decrypt data stored in memory using the AES-128 encryption scheme. • Pre-calculated results stored inside the tasks’ data memories are used for comparison in order to determine whether the result of a computation has been corrupted by the virtualization and reconfiguration procedures.

Application Example and Discussion of Results (2/4) • For task group scheduling, the timing overhead is given in Table 1. • For a BV update, timing overhead is marginally larger, as listed in Table 2.

Application Example and Discussion of Results (3/4) • The resource overhead generated by the proposed architecture is given in Table 3. • For comparison reasons, the resource consumption of a MicroBlaze soft-core processor is included. • Mainly due to the rather long paths in the combinatorial permutation network, the resulting maximum clock frequency of this implementation is 41MHz in this case. • In contrast, without the VMW, a Microblaze processor may run on up to 125 MHz.

Application Example and Discussion of Results (4/4) • The proposed approach is able to dynamically set-up various software-to-processor bindings and to schedule between independent software task groups. • However, a disadvantage of the interconnection networks as proposed in this paper are long combinatorial paths between inputs and outputs of the network. • This lowers the achievable maximum clock frequency of the system. • In order to use a dozen or more processors in the proposed architecture, presumably other interconnection types have to be considered.

Summary and Outlook • The architecture exploits a dedicated virtualization middleware between processors and tasks featuring a dynamically reconfigurable interconnection network that is used for task group scheduling. • The execution of software tasks may be shifted to another processor at any time. • Moreover, aspects that cover the modifications of permutation networks in order to support a number of software tasks that considerably exceeds the number of processors in the system are being evaluated.

Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip Architectures

Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip Architectures

Presentation Transcript

Processor architectures

Processor architectures

On - Chip Communication Architectures

HTR: On-Chip Hardware Task Relocation for Partially Reconfigurable FPGAs

Switching Hardware

Hardware virtualization

On Virtualization of Reconfigurable Hardware in Distributed Systems

Switching Hardware

Virtualization-optimized architectures

Task Switching

Reconfigurable Architectures

On-Chip Context Save and Restore of Hardware Tasks on Partially Reconfigurable FPGAs

Software/Hardware Reconfigurable Network Processor for Space Networks

Hardware virtualization

Hardware Supported Time Synchronization in Multi-Core Architectures

Reconfigurable Architectures

Task-Switching

Reconfigurable Architectures

Transaction Level Modeling of Reconfigurable SOC : Hardware/Software task scheduling

Reconfigurable Communications Processor

Reconfigurable Architectures

Reconfigurable architectures