210 likes | 402 Views
Why Do Upgrades Fail? (and What Can We Do About It?). Tudor Dumitraş. Priya Narasimhan PARALLEL DATA LABORATORY Carnegie Mellon University. Upgrades in Enterprise Systems. Increasing cost of downtime Most outages due to planned events (e.g. software upgrades)
E N D
Why Do Upgrades Fail?(and What Can We Do About It?) Tudor Dumitraş Priya Narasimhan PARALLEL DATA LABORATORY Carnegie Mellon University
Upgrades in Enterprise Systems • Increasing cost of downtime • Most outages due to planned events (e.g. software upgrades) • Software upgrades are unreliable • AOL (1996): outage, routing-table corruption • AT&T Wireless (2003): outages, data-loss, $100M loss • Hospital system (2006): medication unavailable in ER • Causes of upgrade failures:1 1. Broken dependencies 1. Removed behavior 2. Bugs in new version 3. Incompatibility with legacy configurations 1 Survey from [Crameri et al., SOSP 2007] Why Do Upgrades Fail and What Can We Do About It?
Outline • Reasons for unplanned downtime • Four types of upgrade faults • Upgrade-fault frequencies • Upgrade-fault impacts • Reasons for planned downtime • Trade-offs for online upgrades • Challenges for reliable, online upgrades Why Do Upgrades Fail and What Can We Do About It? 3
1 3 ٢ 1 3 2 1 a 3 3 1 2 Upgrade Faults • Procedure violations • Omitted action • Incorrect action • Spurious action • Order inversion • Three sources of upgrade-fault data • User study [Nagaraja et al., OSDI 2004] • Survey [Oliveira et al., USENIX 2006] • Field study (Apache bug reports from 2007) 2 1 3 Procedure violations occur in 43% of cases Why Do Upgrades Fail and What Can We Do About It? 4
Hidden Dependencies Cannot be detected automatically or are overlooked due to their complexity Why Do Upgrades Fail and What Can We Do About It?
Hidden-Dependency Examples Why Do Upgrades Fail and What Can We Do About It? 6 • Service location • File path • Network address • Dynamic linking • Library conflicts • Defective components • Database schema • Application/DB mismatch • Missing indexes • Access privileges(excessive / insufficient / unavailable) • File system • Database objects • URLs • Configuration-parameter constraints • Replication degree • Storage-space availability • Client access to system-under-upgrade • Cached data • SSL certificates • DNS lookups • Buffer cache • Listening-port conflicts • Protocol mismatch • Entropy for random-number generation • Request scheduling • Disk speed
Statistical Cluster Analysis Distance 0.2 0.4 0.0 0.6 0.8 1.0 Type 2: semantic configuration errors f f f f f f f f u f Type 1: simple configuration or procedural errors u f u u u f u f u u u u Type 3: broken environmental dependencies u f f f f f f f f f f f Type 4: data-access errors s s s s s s s s s s s s s s u s u Fault source: user study (u), survey (s), field study (f) s u s s Why Do Upgrades Fail and What Can We Do About It? 7
Upgrade-Fault Frequencies 25 20 15 System Administrators 10 5 0 1% 5% 10% 20% 25% 30% 40% 50% 60% 80% 90% Failure Frequency [Crameri et al., SOSP 2007] User study [ x ] Type 4 Survey Type 3 Probability density x Frequency estimate Type 2 [ ] Confidence interval [ x ] Type 1 0% 10% 20% 30% 40% 50% 60% Fault Frequency Mean: 8.6% Max: 50% Why Do Upgrades Fail and What Can We Do About It? 8
Tolerating Upgrade Faults • Type 1 • Check the syntax of configuration files • Currently, catching 38% – 83% of typos1 • Type 2 • Check constraints among configuration parameters1 • Type 3 • Package management (e.g. Microsoft Update, Debian APT) • Type 4 • Validate the actions of database administrators2 Best practice: phased deployment, to minimize risks3 1 Keller et al., DSN 2008 2 Oliveira et al., USENIX 2006 3 Information Technology Infrastructure Library (ITIL) 2007 Why Do Upgrades Fail and What Can We Do About It? 9
… Upgrade-Fault Impacts • Rolling Upgrades • Big Flip • Imago Rolling Upgrades Big Flip Imago Fault impact 3 Latent error Security vulnerability 2 Faults injected Increased latency 1 Degraded throughput Full outage 0 1 2 3 4 1 2 3 4 1 2 3 4 Fault type Why Do Upgrades Fail and What Can We Do About It? 10
Why Do Upgrades Fail? Atomic end-to-end upgrades can be more reliable than piecewise, phased upgrades Why Do Upgrades Fail and What Can We Do About It? 11 • Online upgrades: more vulnerable to upgrade faults • Rolling upgrades: break hidden dependencies • Can have a global impact • Due to states with mixed versions • Big flip: has single points of failure • Example: the database (vulnerable to Type 4 faults) • In-place upgrades: introduce latent errors
Why Do We Break Dependencies? Datastore Front-end Some dependencies cannot be detected automatically Dependency-resolution is NP-complete Shared-library dependencies Why Do Upgrades Fail and What Can We Do About It? 12
Reasons for Planned Downtime (1) From the history of Wikipedia upgrades: old_title=cur_title cur_id Why Do Upgrades Fail and What Can We Do About It?
Reasons for Planned Downtime (2) Offline upgrade Online upgrade ALTER TABLE old ADD COLUMN old_id INT(8) UNSIGNED NOT NULL; UPDATE old,cur SET old_id=cur_id WHERE old_title=cur_title; ALTER TABLE old DROP COLUMN old_title; INSERT: old: addold.old_idcolumn cur: updateold.old_id UPDATE: old.old_title: updateold.old_id cur.cur_title: updateold.old_id cur.cur_id: updateold.old_id DELETE: old: delete row cur: updateold.old_id Stop using old schema Cannot compute incrementally Why Do Upgrades Fail and What Can We Do About It?
Reasons for Planned Downtime (3) • DB index redefinitions • Sync changes to application servers and DB • Drop columns from DB • Table joins in DB • Aggregates (e.g., max(), min()) • Convert article text to UTF8 • Long running • Bulk updates can hang up DB replication • Might overload the infrastructure Why Do Upgrades Fail and What Can We Do About It?
Trade-offs for Online Upgrades In-Place Out-of-Place • Additional HW • Additional storage • Risk of propagating corrupted data • Need indirection layer • Potential overhead • Installation downtime • Risk of propagating corrupted data • Need indirection layer • Potential overhead • Installation downtime • Conversions impose overhead • Risk of breaking hidden dependencies Mixed Versions • Additional storage • Conversions impose overhead • Risk of breaking hidden dependencies • Additional HW • Additional storage No MixedVersions Why Do Upgrades Fail and What Can We Do About It? 16
Conclusions • Hidden dependencies cause upgrade failures • Localized changes can induce global failures • Dependency tracking has fundamental limitations • DB-schema changes often impose downtime • Challenges for reliable, online upgrades: • Handling hidden dependencies • Computationally-intensive data conversions • Upgrade testing and fault management Why Do Upgrades Fail and What Can We Do About It? 17
Backup Slides Why Do Upgrades Fail and What Can We Do About It?
Reasons for Upgrading Protect against attacks and errors No changes to interfaces or data formats Add new features Backwards compatible Migrate to new platform (end-of-life, efficiency reasons) Data-conversions required Switch vendors Different systems, not different versions Interface changes Change business processes Arbitrary changes Why Do Upgrades Fail and What Can We Do About It? 19
Classification Methodology • 55 distinct faults from three studies • Five classification variables: • Root cause of fault (e.g. procedure, configuration) • Broken hidden-dependency (if any) • Fault location • Original classification • Cognitive level involved • Skill-based: simple, repetitive tasks • Rule-based: problems solved by pattern-matching • Knowledge-based: reasoning from first principles Why Do Upgrades Fail and What Can We Do About It? 20
Upgrade-Fault Characteristics s Database schemas s s s Type 4 s Storage-space availability s s s s Access privileges s s s s f s s s Request scheduling u s u s f u f Cached data f f f Broken hidden-dependency Parameter constraints Type 2 Type 3 f f u u Shared libraries f f f Listening ports f f f f f Communication protocols Type 1 f f f f u Network addresses f u u f u File paths u u u u f Replication degrees u Configuration faults Procedural faults Why Do Upgrades Fail and What Can We Do About It? 21