1 / 24

Understanding and dealing with operator mistakes in Internet services

Understanding and dealing with operator mistakes in Internet services. K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation). Motivation.

platt
Download Presentation

Understanding and dealing with operator mistakes in Internet services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding and dealing with operator mistakes in Internet services K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, T. Nguyen, Rutgers University OSDI 2003 Vivo Project http://vivo.cs.rutgers.edu (based on slides from the authors’ OSDI presentation) Fabián E. Bustamante, Winter 2006

  2. Motivation • Internet services are ubiquitous, e.g., Google, Yahoo!, Ebay, etc. • Expect 24 x 7 availability, but service outages still happen! • A significant number of outages in Internet services are result of operator actions 1: Architecture is complex 2: Systems are constantly evolving 3: Lack of tools for operators to reason about the impact of their actions: Offline testing, emulation, simulation • Very little detail on operator mistakes • Details strongly guarded by companies and administrators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  3. This work • Understanding: Gather detailed data on operators’ mistakes • What categories of mistakes? • What’s the impact on the service? • How do mistakes correlate with experience, impact? • Caveat: this is not a complete study of operator behavior • Approaches to deal with operator mistakes: prevention, recovery, automation • Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service • Like offline testing, but: • Virtual environment (extension of online environment) • Real workload • Migration back and forth with minimal operator involvement CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  4. Contributions • Detailed information on operator tasks and mistakes • 43 exp. - detailed data on operator behavior inc. 42 mistakes • 64% immediately degraded throughput • 57% were software configuration mistakes • Human experiments are possible and valuable! • Designed and prototyped a validation infrastructure • Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service • 2 techniques to allow operators to validate their actions • Demonstrated validation is a promising technique for reducing impact of operator mistakes • 66% of all mistakes observed in operator study caught • 6/9 mistakes caught in live operator exp. w/ validation • Successfully tested with synthetically injected mistakes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  5. Talk outline • Approach and contributions • Operator study: Understanding the mistakes • Representative environment • Choice of human subjects and experiments • Results • Validation: Preventing exposure of mistakes • Conclusion and future work CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  6. Multi-tiered Internet services On-line auction service ~ EBay Client emulator exercises the service Web Server Web Server Tier 1 Application Server Application Server Application Server Tier 2 Tier 3 Database Code from the DynaServer project! CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  7. Tasks, operators & training • Tasks – two categories • Scheduled maintenance tasks (proactive), e.g. upgrade sw • Diagnose-and-repair tasks (reactive), e.g. disk failure • Operator composition • 14 computer science graduate students • 5 professional programmers (Ask Jeeves) • 2 sysadmins from our department • Categorization of operators – w/ filled in questionnaire • 11 novices – some familiarity with set up • 5 intermediates – experience with a similar service • 5 experts - in-charge of a service requiring high uptime • Operator training • Novice operators given warm-up tasks • Material describing service, and detailed steps for tasks CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  8. Experimental setup • Service • 3-tier auction service, and client emulator from Rice University’s DynaServer Project • Loaded at 35% of capacity • Machines • 2 Web servers (Apache), • 5 application servers (Tomcat), • 1 database machine (MYSQL) • Operator assistance & data capture • Monitor service throughput • Modified bash shell for command and result trace • Manual observation • Noting anomalies in operator behavior • Bailing out ‘lost’ operators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  9. First Apache misconfigured and restarted Second Apache misconfigured and restarted Application server added Example trace • Task: Add an application server • Mistake: Apache misconfiguration • Impact: Degraded throughput CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  10. Sampling of other mistakes • Adding a new application server • Omission of new application server from backend member list • Syntax errors, duplicate entries, wrong hostnames • Launching the wrong version of software • Migrating the database for performance upgrade • Incorrect privileges for accessing the database • Security vulnerability • Database installed on wrong disk CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  11. Operator mistakes: Category vs. impact • 64% of all mistakes had immediate impact on service performance • 36% resulted in latent faults • Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment • Obs. #2: Undetectable latent errors will still require online-recovery techniques CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  12. Operator mistakes • Misconfigurations account for 57% of all errors • Config. mistakes spanning multiple components are more likely (global misconfigurations) • Obs. #1: Tools to manipulate & check configs are crucial • Obs. #2: Careful maintaining multiple versions of s/w CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  13. Operator categories • Experts also made mistakes! • Complexity of tasks executed by experts were higher CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  14. Summary of operator study • 43 experiments  42 mistakes • 27 (64%) mistakes caused immediate impact on service performance • 24 (57%) were software configuration mistakes • Mistakes were made across all operator categories • Trace of operator commands & service performance for all experiments • Available at http://vivo.cs.rutgers.edu CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  15. Talk outline • Approach and contributions • Operator study: Understanding the mistakes • Validation: Preventing exposure of mistakes • Technique • Experimental evaluation • Conclusion and future work CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  16. Validation of operator’s actions • Validation • Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients) • Correctness is tested by: • Migrate the component(s) to virtual sand-box environment, • Subject to a real load, • Compare behavior to a known correct one, and • Migrate back to online environment • Types of validation: • Replica-based: Compare with online replica (real time) • Trace-based: Compare with logged behavior CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  17. Compare Application State Database Compare Compare Validating a component: Replica-based Client Requests Online slice Validation slice Web Server Web Server Tier 1 Web ServerProxy Application Server Application Server Application Server Tier 2 DatabaseProxy Tier 3 Shunt CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  18. Compare Web ServerProxy State State Application Server Database DatabaseProxy Compare Shunt Validating a component: Trace-based Client Requests Online slice Validation slice Web Server Web Server Tier 1 Application Server Application Server Tier 2 Tier 3 CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  19. Implementation details • Shunting performed in middleware layer • Each request tagged with a unique ID all along the request path • Component proxies can be constructed with little effort (mySQL proxy is ~ 384NCSL (402kNCSL) • Reuse discovery and communication interfaces, common messaging core • State management requires well-defined export and import API • Stateful servers often support such API • Comparator functions to detect errors • Simple throughput, flow, and content comparators CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  20. Validating our prototype: results • Live operator experiments • Operator given option of type of validation, duration, and to skip validation • Validation caught 6 out of 9 mistakes from 8 experiments with validation • Mistake-injection experiments • Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput) • Operator-emulation experiments • Operator command scripts derived from the 42 operator mistakes • Both trace-based and replica validation caught 22 mistakes • Multi-component validation caught 4 latent (component interaction) mistakes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  21. Reduction in impact with validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  22. Fewer mistakes with validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  23. Shunting & buffering overheads • Shunting overhead for replica-based validation  39% additional CPU • All requests and responses are captured and forwarded to validation slice • Trace-based validation is slightly better  32 % additional CPU • Overhead is incurred on single component, and only during validation • Various optimizations can reduce overhead to 13-22% • Examples: response summary (64byte), sampling (session boundaries) • Buffering capacity during state check pointing and duplication • Required to buffer only about 150 requests for small state sizes CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

  24. Caveats, limitations & open Issues • Non-determinism increases complexity of comparators and proxies • E.g., choice of back-end server, remote cache vs. local disk, pseudo-random session-id, time stamps • Hard state management may require operator intervention • Component requires initialization prior to online migration • Bootstrapping the validation • Validating an intended modification of service behavior – nothing to compare with! • How long to validate? What types of validation? • Duration spent in validation implies reduced online capacity • Future work: Taking validation further… • Validate operator actions on databases, network components • Combine validation with diagnosis for assisting operators • Other validation techniques: Model-based validation CS 395/495 Autonomic Computing SystemsEECS,Northwestern University

More Related