Conclusions from the European Roadmap on Control of Computing Systems

Conclusions from the European Roadmap on Control of Computing Systems Karl-Erik Årzén, Anders Robertsson, Dan Henriksson LTH, Lund University, Sweden Mikael Johansson, Håkan Hjalmarsson, Karl Henrik Johansson Royal Institute of Technology , Sweden FeBiD’06, Vancouver, April 3, 2006

Background: Recent large research interest, (academically as well as industrially initiated) in Control-based methods for resource management in real-time computing and communication systems In most cases, allocation of memory, computing and/or communication resources

Examples • Performance control of web-servers, • Dynamic resource management in embedded systems, • Traffic control in communication networks, • Transaction management in database servers, • Autonomic computing etc.

eBusiness Multi-tier systems of Web browsers, business logic and databases Feedback at various levels Queue Control IBM, HP, Microsoft, Amazon, …. Challenges: • Modeling formalisms (DES, ODEs, queuing theory, …) • Design of software and computing systems for controllability [courtesy J. Hellerstein]

www.artist-embedded.org/FP6 ARTIST2 • Roadmap outcome from ARTIST2-workshop in Lund, Sweden, May 2005 • EU/IST FP6 Network of Excellence • Embedded Systems Design • NSF-supported workshop on ”Future trends in control of computer systems” by Hellerstein, Tilbury & Abdelzaher, May 2005

Roadmap Available for download at http://www.control.lth.se/user/karlerik/roadmap1.pdf Experiment: You have wireless network access – try the server! …or not.

An admission (control) problem

Report from Swed. Emergency Management Agency

How to handle the overload problem? • Overprovision • (more capacity than needed on average) • Admission control • Some are denied access, but server continues to operate. • Change service • (”sending text-only at high loads”)

Why is control of computing systems interesting? • Multidisciplinary: • Several ”new” challanges • Not covered within one traditional ”research domain” (queueing theory, computer science, systems and control…) • Need systematic tools for design and analysis • robustness to ”disturbances” • better performance • Cost of operating computing systems is raising/dominating (60-90%) [Hellerstein et al, 2005]

Outline • Background & Motivation • Computer systems in a control theoretic framework • Modeling issues • Roadmap: Research challenges in • Control of server systems, • Control of CPU resources, • Feedback scheduling of control systems, • Control of communication networks, • Error control of software systems, • Control middleware. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - [4.15 pm] Panel – Top Three Challenges in Control of Networks and Systems

Contents of roadmap Six research areas: • Control of server systems, • Control of CPU resources • Feedback scheduling of control systems, • Control of communication networks, • Error control of software systems, • Control middleware. ”…how flexibility, adaptivity, performance and robustness can be achieved in a real-time computing or communication system through the use of control theory”

Modeling Formalisms • Heuristic approach vs. Model based control • ”Inherent robustness” from feedback control • One reason why many ad hoc stratergies work • More can be gained (systematic design & analysis) Basic principle: Use simple enough models for design and analysis • Model should capture essential dynamics and show similar behavior as system for different distributions and load cases.

Modeling Formalisms • Identification • Sampling (SoH), ”noise”, inherent nonlinearities • ”First principles” (conservation, queueing theory) • Computing systems: discrete-event dynamic systems (DEDS) + real-time systems => timed automata or timed Petri nets • Risk of state-space explosion (does not scale with arrival/service rates) • Well-suited for safetey and blocking properties, but how does it relate to stability and robustness?

Modeling of queueing-systems • Discrete event models • Queue theoretic model (Markov chains etc.) • Flow models (cont. time / average models) • Discrete time models

Modeling aspects ”Gain-scheduling”: (standard control principle): • ”Choose among different control-parameters depending on e.g., operating condition”. • ”Good” model structure of corresponding computing system may change with work load (e.g., for server systems) • Flow models OK for high loads • DEDS-models feasible för low loads • Interpolation between different model structures?! • Transient vs steady-state behavior

Actuator Mechanisms • The difference between the service rate, µ, and the arrival rate, λ, determines the delay experienced by the requests. • Enqueue actuators: (Changing the arrival rate) • Admission control mechanism • Change inter-arrival period of task ”upstream” in multitiered system • Dequeue actuator: Changing the service rate: • Number of server threads • Quality adaptation • Dynamic voltage scaling

Crusial not to trigger network resend (lost package), -”cmp Heracles and the hydra” [Sha ?] Actuators - Implementation aspects • Gate model: • Call gapping — accept first u(kh) calls in control interval • Percent blocking — preserves distribution

Example: Highway congestion in LA [Varaiya et. al.] Animation Related reseach areas Similarities/differences of the different domains • Traffic flow control • Manufacturing and supply chains • Communication networks • Power networks with respect to • Where does the congestion appear? • Routing? • Available information (dest.)? • Time/distance matters? • ”Package dropping” – OK or not? • Control action?

Control of server systems • Temporal control locally at server • Direct or ”indirect” objective (service provider vs. customer) • Queue-management and load balancing • Inherent nonlinearities • Multi-tiered systems including large eCommerce systems

Example: Admission control Objective: • Good transient behavior for traffic changes • Preserve good performance for overload situations Measure of admission • queue length • average time • utilization • CPU load / energy consumption • memory • …

Example: Feedforward + feedback

Control of server systems • Prediction and state estimation based control • State and actuator constraints • Interestings region: When do the flow-models cease to be valid? • Changing models and criteria in different load situations... • Very exciting new results on discrete-event based estimation and control • DE-sampling vs. DT-sampling • control: ratio 1/5, • bandwidth allocation: 1/2

Server systems - Research challenges • Modeling issues (as discussed before) • Control + queueing theory = ? • Event-based control – theory gap • Control objectives • References (load, utilization) • Performance metrics and cost functions • (upcrossing probabilities) • Security, reliability, availability, efficiency… • Design patterns/Control patterns • Software structure + control structure and analysis for software design • Well known in e.g., process control (ratio control, cascade, midranging etc) • When should a queue problem be considered as • an admission problem? • an delay control problem? • Large-scale distributed systems / multi-tier systems • Distributed control, MPC, …

Control of CPU resources • A large amount of feedback-based or adaptive global QoS management systems have been proposed. • Early ad hoc schemes of multi-level feedback queue scheduling control-theoretical approaches using FC-EDF, EUCON [Stancovic, Lu, Buttazzo,…] The EDF-FC scheme (from [Stankovic et al., 1999])

Control of CPU resources –The challenges and research directions • Multiprocessor systems • Power-aware CPU scheduling • Dynamic Voltage Scaling • joint optimization problem of minimizing energy while still meeting real-time constraints • already today receives a considerable attention from the research community. • End-to-end resource management: • Resource management in distributed systems where an activity spans multiple nodes • Hierarchical resource allocation schemes • Cascaded structures with local allocation • Efficient feedback scheduling mechanisms • Scheduling algorithm overhead – online optimization doable?

Feedback scheduling of control tasks Actuation • Task period hi Solve two different problems: • Resource regulation • Control the total utilization to avoid overloads • Optimal resource distribution • Assign individual task periods to optimize performance

Example: Dynamic Real-Time Scheduling of Model Predictive Controllers • Based on on-line optimization of a cost function • Convex optimization problem solved in each sample • Iterative anytime algorithm • Result gradually refined up to a certain bound • Attractive control strategy • Straightforward to use for multi-variable processes • Ability to handle constraints • Unattractive real-time properties • High computational demands • Very large variations in execution times Henriksson et al. 2004

Example: Feedback scheduling of MPC control tasksMain idea A process in stationarity may need less resources than a process in a transient phase Use feedback from the optimization algorithm to determine • for each MPC task, when to terminate the optimization and output the control signal, and • the optimization may be terminated early and still produce acceptable results. • which of several ready MPC tasks that should be scheduled for execution. [Henriksson et. al., 2004]

Current values of the cost functions act as dynamic task priorities • Constitutes an on-line QoS measure for the task • Reflects the relative importance of the tasks • Feedback scheduler distributes the computing resources • Schedules MPC task with highest cost • Invoked after each iteration • Implemented as a separate task

Cooperative robot task under resource constraints • Master and slave configuration • Ball and beam application

Problems: • MPC tasks exhibit very large variations in execution time • Traditional scheduling theory not applicable • Solutions: • Premature termination of optimization • Dynamic scheduling based on cost functions

The challenges and research directions for feedback scheduling of control tasks include all the challenges and research direction of control of CPU resources. Additionally, the following items are important: • Temporal robustness indices • Formal performance guarantees • open question whether it is possible to combine the flexibility implied by feedback scheduling with formal guarantees

Control of Communication Networks Example: • Feedback control is embedded in the TCP protocol in the form of a sliding window mechanism. • Introduced in the 80’s to solve the congestive failure problems that had brought down the network. • We have not experienced system-wide congestive failures again even though the network has grown orders of magnitude. • This is a testament of the effectiveness of feedback control in a highly dynamic, decentralized, and fast changing environment. Remark: [9.00] Robust yet Fragile: Intrinsic Tradeoffs in Layered Architectures

Control of Communication Networks • Feedback control mechanisms are fundamental for the separation of communication layers • Gives robustness and allows local optimization and refinements Example • Reliable data transfer over wireless link through suitable feedback control of • transmission power • modulation scheme • channel coding

Research Challenges in Control of Communication Networks • Architectures and model abstractions for network control • Network models suitable for control and observer design • Robustness of large scale and distributed systems • Resource management in wireless networks • Cross-layer adaptation for new services and optimized performance

Cross-layer adaptation for improved performance of cellular and wired networks • Bandwidth variations in radio link give performance degradations due to large end-to-end delay and improper transport protocol • Proxy between cellular and wired networks adapt sending rate to bandwidth variations through available radio link state information TCP App Server 3G-SGSN RNC BW variations 3G-GGSN Internet PROXY BTS 3G Cellular Network TCP BTS Terminal

Proxy hybrid control law • Controller in proxy regulates sending rate based on • Events generated by bandwidth changes obtained from RNC • Sampled measurements of queue length in RNC [Möller et al., 2005]

Bandwidth utilization New protocol End-to-end protocol Experimental evaluation • Improved time-to-serve-user and link utilization compared to traditional end-to-end protocol • Stability and robustness analysis of new protocol • Ongoing experimental evaluation and testing with [Möller et al., 2005]

Network observer Wireless network Control law Plant observer Network-aware control architecture • Estimate network state • Delay • Data loss probability • Bandwidth • Adjust controller accordingly

Network observer Wireless network Control law Plant observer Network-aware controllers Control algorithms to cope with communication imperfections • Control under network delay • Control under data loss • Control under bandwidth limitation • Control under topology constraints Characteristics depend on network technology

Delay estimation • Internet round-trip time (RTT) data are noisy with piecewise constant average • Complex network dynamics hard to model • RTT estimation in TCP: • Improved estimation thru Kalman filter with hypothesis test (CUSUM filter) [Jacobsson et al., 2004]

Control middleware Middleware: • a software abstraction layer that mediates the interactions between a component or application • Commonly used in distributed system to provide communication services. • Java-RMI, Microsoft’s .COM, and CORBA… • Networked embedded system applications, • e.g., mobile systems and sensor systems. • GAIA [Romn et al., 2002], WSAMI [Issarny et al., 2005], and AURA

Control middleware Research Directions • The most important research item for control middleware is to develop these systems from research prototypes to something that may be used more widely. • Middleware functionality: Still an open question whether the middleware should • be passive, i.e., provide sensing and actuation services that the application can use to itself implement the feedback control, or if it should be • active, i.e., the middleware should be responsible for the actual control loop. Both of these approaches have advantages and disadvantages.

Error control of software systems [L.Sha] • The idea behind error control of software is to use ideas similar to the ideas used in feedback control in order to detect malfunctioning software components and, in that case fall back on, a well-tested core software component that is able to provide the basic application service with guarantees on performance and safety. • Provide techniques and tools that support making the semantic assumptions of each software component explicit and machine checkable.

Simple and reliable core • System remain in recoverable states • SIMPLEX-architecture [Sha] • High accurance vs high performance • Need to stay in recoverable state • Runs in parallell --- cmp ”bumpless transfer” ---------------------------------------------------------------------------- • ORTGA [FeBID’06] • Maximum stability region • How to detect conditions for switches? (FDI) • False alarm vs. Non-recovery risk of instability

Roadmap Available for download at http://www.control.lth.se/user/karlerik/roadmap1.pdf

Conclusions • Thank you for your attention! • Questions? • Panel debate

Conclusions from the European Roadmap on Control of Computing Systems