100 likes | 120 Views
Action Breakout Session. Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen, Siva. Breakout Goal.
E N D
Action Breakout Session Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen, Siva
Breakout Goal • Identify research questions and issues related to adaptive action invocation to enhance the dependability and security of distributed systems • Customer is the “system administrator,” not the end user
Breakout Process • Define actions by example • Discuss cross-layer interaction and coordination • Distill underlying principles
Key Observations • Distinguish between control actions (e.g., “slow down”) and data actions (e.g., “drop packets”) • Distinguish between internal/locally performed actions and actions that affect global behavior • Control loops operating in multiple levels, regionally and globally • Performance-related actions are the basic building block • Control system itself can be target of an adversarial attack
Working Examples • Network Storage Service; Media Streaming Service • Multiple instances of service various places in network • Direct requests to best available service instance • Balance requests among service instances • Fall back to alternative service instance in the face of failure or DOS attack • Coordinate measurements on client-side and server-side to reduce load through admission control and content adaptation • Distinguish between server overload and network overload • For clients “not in the loop” (heterogeneous clients, adversarial clients), proxy the necessary behavior inside the network • Network Denial of Service • Overload data traffic and starve control traffic • Secondary performance effects: session resets, router CPUs driven to high utilization, etc.
Control Theoretic Viewpoint • Black boxes that are managed by a control system • Actuation points that can acted upon to control the system • E.g., Apply backpressure to clients to slow down request rate (control); degrade content quality (data) • E.g., Prioritize/reserve bandwidth for control traffic; Policy settings are control actions, enforcement of policy are data actions • Single vs. independent control loops: which is better? • Theory provides tools for managing “disturbances” • Note that the control system can itself be the target of attack • Hellerstein: Action is a change to a configuration • E.g., buffer pool size, weights in load balancer • E.g., uninstall/reinstall software
General Observations • Causality and Visibility • Actions can lead to cascaded actions • Can interactions/side effects be modeled/made explicit? • Action graph model: probability that a following action will be invoked as the result of a given current action • In general, difficult to determine in advance • Could it be learned via observe/analyze? • Feasible to place action points at every potential bottleneck site? • Note that routers are badly designed black boxes, difficult and time consuming to extract their internal state • Tradeoff between centralized collection of state that may be “complete” but out-of-date vs. decentralized collection that may be more timely but globally incomplete • Principle of containment: first do no harm, local actions potential less disastrous than global actions
General Observations • Managing Disturbances • Instabilities arise where delays in taking action are introduced • Latencies in response • Imperfect knowledge of the state • Tradeoff in making decisions based on longer intervals spanning more state vs. shorter intervals spanning less state • Time intervals adapt … short time to ensure useful work always being done • E.g., Disk scheduling in Storage Server • You can only do work you are aware of • Keep the queues short to achieve best performance
General Observations • Predictive actions • Waiting too long to detect problem limits ability to respond • Characterize workload/response changes as signature of impending system performance failure • Response to workload changes: “gradual” vs. cliff degradation • E.g., as I/O workload grows, predict increases in response latency • E.g., IBM detects changes to slope of activity to trigger resource allocation to manage flash crowds in web server farms
General Observations • Don’t ignore the human decision maker • Human operators in the loop • Research challenge: visualizing the configuration and state of the system to a human decision maker • Higher order configuration and administration tools and frameworks