Self-Managed Networks of Unmanned Vehicles

Self-* Networks of Unmanned Vehicles CSE 597c, Fall 2006 Introduction to Self-* Systems Sept. 14, 2006 Bhuvan Urgaonkar

Definition • Self-* Systems • A regular expression • Self-tuning, self-configuring, self-healing, self-stabilizing, … • Autonomic computing [IBM] • Inspired by the autonomous central nervous system in a living organism • In humans and other vertebrates, the part of the nervous system that regulates the involuntary activity of the heart, intestines and glands.

Some History • What do you think the first Self-* system was? • Wind/water mill? • Emergence of (semi-) autonomous systems starting with the industrial revolution • Steam engine, printing press, car, … • Could carry out certain tasks without human intervention • Development of feedback-control theory, signal processing • Thermostat (Albert Butz of the Thermo-Electric Regulator Co., Minneapolis, 1885) • Cruise control • Early 20th century onwards • Major advances in engineering & emergence of computing • Now you could program a mechanical/electrical/… system • More complex autonomous systems

More History • Artificial Intelligence • Make a machine/computer do what a (smart/able) human can do • Learn like a human does • Sometimes easy, very often not! • Turing Test • A computer that can pose as a human passes the Turing test • A definition of Self-* ness? • Would imitating human behavior alone be enough?

Complexity of Modern Systems • Computer systems grew in complexity • Others as well, but lets talk about CS • NYTimes: All science is computer science • Complex h/w, s/w, Distributed systems, Heterogeneity, … • Can’t be managed by housewives who are given a manual – WW II !! • IBM’s DB2 database server has about 80 parameters! • Modern systems operate in highly dynamic conditions • Human-intervention based operation often infeasible • Error-prone • Slow • Expensive • …

Operating Environments that Prohibit Human Participation • Robots or machines operating in mines, under oceans, volcanic areas, … • Must “take care” of themselves

Defining Self-* ness • The Turing test doesn’t quite capture Self-* ness • Sometimes we want better than what even the smartest/fastest human can do! • Not quite the same as the original AI goal • And not a superset of it • Some intersection, but also some orthogonal requirements

Outline • Motivation and history • Examples • Self-* networks/distributed systems • Relevant areas/useful techniques • Summary

Example 1:General-purpose Operating Systems • CPU scheduling and memory management • First computers did batch processing of jobs • A human would schedule the jobs • Multi-programming came up • Dynamically changing set of processes • Interleaving of computation and I/O • Response time sensitive processes such as editors • The CPU scheduler had to adapt to these dynamics • Self-tuning behavior was desired • Same for memory manager

Self-tuning Systems • Keep output within desired bounds even when the external environment is changing Feedback System output (e.g., performance) External environment including inputs System components

Example 2:Mission-critical Operating Systems • OSes running on space-crafts • System had to discover errors and recover on its own • Self-healing systems • Initial/simple solutions: High degree of redundancy • Introduce redundancy to deal with failures • Implement mechanisms to quickly discover failures • OK for a space-craft, but not for a more “down-to-earth” system • Could be very expensive • How can a system self-heal without excessive redundancy? • Later: Software became very complex • S/w failures far more serious problem than h/w failures! • Software engineering, programming languages

Self-healing Systems Feedback • Keep output within reasonable bounds even when internal components fail • What’s different from a self-tuning system? • Failures are internal events; changes in operating environment are external events • Note: Failures might be induced by external events Component Failure System output (e.g., performance) External environment including inputs

Self-Stabilization • Green=good, Blue=bad • Guaranteed to return to a good state, eventually, on its own • Related to fault tolerance • How?

Classification of Self-* Systems • Self-tuning • Performance • Self-healing • Failure handling • Self-stabilizing • Convergence • Is this a good classification? • Note: Not necessarily a non-intersecting classification

Defining Self-* ness (contd.) • First define for each member of our classification

Quantifying Self-tunability • How good is the system at meeting performance targets under dynamic operating conditions? • E.g., Can the system ensure response time degradation is always at most proportional to increase in request arrival? • Note: The system can change its internal state (e.g., increase its capacity dynamically) to achieve its goal

Quantifying the Goodness of a Self-healing System • How good is the system at maintaining functionality under failures? • E.g. 1, Can the system continue functioning even after N failures? • E.g. 2, Can the system continue to offer the same response time even after N failures?

Quantifying the Goodness of a Self-stabilizing System • How long does it take the system to return to a good state after a perturbation?

Defining Self-* ness (contd.) • One approch: Define a vector whose individual elements characterize self-tunability, goodness of self-healing, and self-stabilization • E.g., <ST=excellent, SH=poor, SS=good> • Conflicting goals! • E.g., maintaining performance might require fewer components; dealing with failures might require redundancy • Need to understand what is more important • Context dependent • Relative importance of various self-* properties vary across systems

Distributed Systems • How do things change? • Cons: Problems associated with a distributed system • Data consistency • Larger communication delays • Heterogeneity • More failures, more kinds of failures • … • Pros: • More sources of redundancy might mean better self-healing • More resources might mean more options to self-tune • Any more?

Example 3:Networking: TCP/IP • Simple AIMD based congestion control • De-centralized, only at end-points • Has worked pretty well! • Scaled to current Internet • I consider TCP a good Self-tuning protocol • What about link failures and how IP handles them?

Example 4:Enterprise/Utility Computing • Varying workloads, complex applications • Human management infeasible, error-prone • How to manage resources to maximize revenue while meeting client requirements

Example 5:Search Engine: Google • Web content highly dynamic • Self-tuning: • How good is the search engine at keeping up with changes in Web content? • Self-healing: • Thousands of servers and disks in their data center, failures every few hours! • Does google.com keep working despite these failures? How much human intervention does this need?

Relevant areas/useful techniques • Multi-criteria Optimization Techniques (economics) • Analytical modeling (e.g., to infer resource needs of an app) • Measurement techniques • Feedback-control theory (reactive) • Statistical techniques for prediction, learning (reactive+proactive) • Biological, ecological, social networks • How do termites with pinhead-sized brains build air-conditioned colonies? • Theoretical CS: online algorithms, approximation algorithms • Distributed computing • Systems issues • Efficient & bug-free software, prototyping, simulation, experiment design)

Summary: Key Principles • Keep is simple, silly! • Occam’s razor • E.g., Partial automation vs complete automation • Understand and define system goals clearly • Which Self-* properties are essential, which are not? • Understand system properties, operating environments • One size may not fit all • Measurements • Prediction, classification, learning, feed-back control • Design for agility (assuming online operation) • Efficient algorithms & systems mechanisms

Self-Managed Networks of Unmanned Vehicles