1 / 132

Introduction to Massive Upgrades and Changes

Introduction to Massive Upgrades and Changes. Instructors: Tom Limoncelli With Material From: “The Practice of System and Network Administration” by Limoncelli & Hogan http://www.EverythingSysadmin.com. Class Exercise. Multi-Purpose Server Upgrade

jenis
Download Presentation

Introduction to Massive Upgrades and Changes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Massive Upgrades and Changes Instructors: Tom Limoncelli With Material From: “The Practice of System and Network Administration” by Limoncelli & Hogan http://www.EverythingSysadmin.com

  2. Class Exercise Multi-Purpose Server Upgrade Select a machine from your network and walk through what would be involved in upgrading the OS. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  3. Our gift to all attendees • The Paper-O-Matic • (paperclip not included) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  4. Exercise • Service Checklist – list services • Plan each service – supported on new OS? • Document Verification (test) procedure • Document Back-out plan • Schedule the big event – when & how long • Announce as appropriate – where and when? • Test, Upgrade, Test • Communicate Completion Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  5. Paper-O-Matic Alternatives?

  6. Introductions Your instructors: • Tom Limoncelli – SA since 1988, UNIX since 1991. Currently Director of Network Operations, Lumeta Corp. Previously at Bell Labs. • Co-author of “The Practice of System and Network Administration” Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  7. Definition of “Massive” Scope larger than “normal” projects • Impacts a large number of customers • Failure will be highly visible Examples: • Upgrading a server • Rolling out a new application • Renumbering IP networks • Changes on a large WAN • Day-long reorganization Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  8. Other Commonalities • Large number of SAs on team • Highly visible to customers • Expensive • Potential for expensive mistakes Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  9. What causes failure? • Lack of planning -> chaos • Miscommunication -> chaos • Lack of documentation -> chaos • LACK OF PROCESS -> chaos Change management reigns in chaos Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  10. OVERVIEW • Class Exercise: Upgrading 1 Server (40 min) • Introductions (5 min) • Change Management Basics (30 min) • Service Conversion Theory (15 min) • BREAK • Class Discussion: Nagano (10 min) • Technique: IP Renumbering (30 min) • Managing Maintenance Windows (40 min) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  11. Change Management Basics

  12. Definition: Change Management • The process that ensures effective planning, implementation, and post-event analysis of changes made to a system. • Changes should be documented, have a back-out plan, and be reproducible, and communicated as appropriate. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  13. What’s it all about? • To the casual observer: • Documented change requests, approved or rejected before implementation. • Change management is: • Scheduling – for least impact • Communication – within team, to customers, to management • Planning – all eventualities covered Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  14. Formal or Informal? • The larger the site, the more formality is required. • Large sites often have a change-control counsel that meets weekly to approve requests. • Smaller sites simply need manager’s approval. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  15. Change Requests Handout #1: Quiet Time Is Coming • A written document • What will be changed • What is the expected impact/outage • When is change needed by • Who requests change, who is it for • Back-out plan • Responsible people Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  16. Types of Changes: • Routine Updates • Major Updates • Sensitive Updates Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  17. Routine Updates • Can happen at any time. Invisible to customers • Ex: Updating a directory/authentication server, debugging a printing problem, altering monitoring systems, enabling an existing router interface. • Failure scope: minimal • Communication needed: None Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  18. Major Updates • Affect many systems or require significant system, network, or service outage or touch a large number of systems • Ex: upgrading authentication systems, changes to email or printing infrastructure, upgrading core network infrastructure, installing new (non-hotplug) router interface. • Failure scope: affects many, many people • Communication needed: email or similar Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  19. Sensitive Updates • Does not seem to be major but would cause significant outages if there was a problem with it. • Ex: Altering router configurations, global access policies, firewall configurations, alterations to a critical server, installing card in router that “should be hot-plug”. • Communication needed: “pull” mechanism like web site, newsgroup, forewarn helpdesk Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  20. Classification Notes: • Different definitions at different sites, or parts of sites. • E-commerce company considered adding a new host to a corporate network to be “routine”, but to the customer-visible network “sensitive”. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  21. When to do updates? • Major Updates – based on organization’s maintenance window and SLA agreements • Sensitive Updates – should happen outside of peak usage times to minimize impact and maximize time to discover & rectify problems • Routine Updates – any time (what about network quiet times?) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  22. Network Quiet Times • Official days where all changes (outside of repairing outages) are forbidden. Sometimes global, often local. • Examples: • The last 15 days before tax filings due each quarter • 2 weeks before major software release scheduled to ship (and until 3 days after shipment) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  23. Handout #2: SAMPLE CM POLICY A policy you can adopt NOW

  24. The CM Meeting: • Meetings where proposed changes are reviewed, discussed, and scheduled (if approved). • Typically weekly or monthly depending on quantity of change. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  25. Sidebar: Daily CM Meetings? • .COM had stability problems significant enough to be front-page news • Had daily meetings due to extreme growth rate. (Mostly Change Control rather than CM) • Postponed CM Requests on days of “bad weather”. • Daily meetings let them deduce “what changed” when problems sprang up. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  26. Meetings formally document: • What will be done and when • How long will the change take • What can go wrong • Testing procedures • Back-out plans Side benefit: Forces you to think these things out. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  27. Meetings Communicate: • Make other people aware of changes • They can recognize potential source of problems • Meeting should include representatives from across the company • They can then communicate within their own group about the changes Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  28. CM Meeting’s Global Impact • Attendees develop an overall view of what’s happening within the company • Senior SAs/managers can spot problems before they happen • Reduces entropy and leads to stability Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  29. Communicating Changes • How are CM issues communicated to customers? (Email, Newsletters, etc.?) • When to communicate: • When there will be an outage • When procedures/software will change • Communication via email should only be to customers that will be affected Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  30. Explicit approval vs. objection Email pre-announcing an outage gives opportunity request a reschedule: • Explicit Objection – Outage will happen unless someone explicitly objects • Email: “To request that this maint. window be rescheduled, please contact Joe Smith.” • vs. Explicit approval – Outage will happen only after explicit approval. • In person: Request at the CM board meeting. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  31. Case Study: WAN CM (handout #1) • “The secret to a reliable WAN is good procedure.” • Maintained schedule of outages and “Quiet Times” • Scope (Global or local), Impact & Risk • All changes to back-bone routers required approval by CM Request Board • LAN routers only need CM approval if outage expected. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  32. Case Study: Network Life-cycle • “Build-out” (birth) • Entry -- New construction • CM in the form of documenting rather than approval • Goal -- Get to certification • “Certification” • Entry -- Installation complete, testing done, check-list of requirements met (VRRP/HSRP, ) • Goal -- Maintain uptime/reliability/performance • “Decommision” • Entry -- Elvis has left the building • Goal -- Eliminate dependence, in order, by deadline Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  33. OVERVIEW • Class Exercise: Upgrading 1 Server (40 min) • Introductions (5 min) • Change Management Basics (30 min) • Service Conversion Theory (15 min) • BREAK • Class Discussion: Nagano (10 min) • Technique: IP Renumbering (30 min) • Managing Maintenance Windows (40 min) Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  34. Service Conversion Theory • Definition • Theory of Pillars vs. Layers • Prepare the customers • Minimize Intrusiveness • Flash cuts vs. phased approach • Back-out plans • Grouping changes Ex: “Rioting Mob” Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  35. Definition: Service Conversion Any change that requires touching many hosts to make a single, or many, changes • The same 1 change on hundreds of hosts • The same 50 changes on hundreds of hosts Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  36. Examples: • Service being replaced: • New client software on each host • Or, Each client re-pointed to new server • Rolling out new software to each client • IP Renumbering • Enabling new feature: Moving to DHCP • Splitting customers over a new server • To load-balance or to divide company before spin-off Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  37. Prepare the customers Does the new service require customers change work methods? • Can they use the old client? • Is training available? • Is new documentation complete and distributed? • Is the helpdesk trained on • potential conversion problems? • the new software itself? Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  38. Minimize SA Intrusiveness Ultimately, you want to minimize intrusiveness to the humans • Does the conversion require service outage? • Can outage be avoided? • Can you minimize the outage duration? • Can the outage be scheduled out of hours? • Will we visit the customer’s PC more than once? • Can the visits be avoided? Combined? Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  39. Flash cut vs. phased approach • Flash cut – Change all at once • Upgrade a server “in place” • (implies little/no ability to back out) • Phased approach – Slower and safer • Provide old and new service for a period (like new area codes) • Or, budget for duplicate hardware, install off-line, move clients over slowly Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  40. Successful Flash-cuts The secret to successful flash-cuts is testing, testing, testing Example: New calendar system doesn’t communicate with old system, data will be exported and all clients will be required to switch on specific day. • New calendar system on new hardware • Major amounts of load-testing performed • Trial users test new system (with understanding that data will be wiped on conversion day) • QA metrics defined and met. • Documentation & training for customers • Helpdesk trained on new s’ware and conversion Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  41. Successful Phased Conversion • “One, Some, Many” Technique • Test conversion w/successively larger groups. • If entire group converted successfully, move to larger group. • If any failures, revise process, shrink group. • “One, some, many” • One – My machine. Large incentive to get right • Some – Co-workers and SAs that can give feedback. • Many – Larger and larger groups, starting with the least risk averse Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  42. Pillars vs. Layers Suppose: 50 tasks to be done on all hosts Layered approach – Perform one task for all hosts before moving on to next task. Pillars approach – Perform all required tasks on a host before moving to next host. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  43. What to layer or pillar? Layer tasks that are not intrusive to customers. Pillar tasks that are. Example: a new calendar server • Layer: creating accounts • Pillar: visiting customer to install new software, freezing schedule and converting it to new system, first connection to initialize password Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  44. Pillar benefits Pillared approach means scheduling one period with each customer, and less annoying to customer. • Scheduling and re-visiting missed customers has extremely high overhead • Two 5-minute meetings is more work than a single 10-minute meeting • Multiple visits = multiple annoyances Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  45. The Rioting Mob Technique • Tom’s group needed to make many changes to 1000 hosts in 1 month. • UNIX: Script written and tested (1,s,m) • Windows: 5-6 manual changes • Other devices: Ad hoc (mostly IP addr) • Layered all server-side changes. • How could we do the pillars? Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  46. Rioting Mob Example • Numbered each hallway & announced schedule • Mon: Convert a hallway • Tue: Fix problems and improve process • Wed: Convert another hallway • Thu: Fix problems and improve process • Fri: No changes (so we couldn’t break anything and ruin our weekend). Catch up with other work and email. Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  47. First Try: • 9am: entire team starts at hallway • 2 PC techs went office-to-office down left-hand side making changes. • 2 UNIX techs went office-to-office down left-hand side making changes. • Similar pairs went down right-hand side • 2 senior SAs available to debug and/or handle oddball hosts • SAs called into “command central” to request IP addresses Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  48. Tuesday • Cleaned up anything we broke • Brainstormed on how to improve • Detailed what happened minute by minute • Detailed problems • Brainstormed on solutions • Revised process Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  49. New Process • Make initial pass through hallway: • Give customer a gentle warning to log out • Call in requests for IP addresses • Id nonstandard machines for senior SAs to focus on • Second pass through hallway: • Do actual conversion Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

  50. Results • Conversion much smoother, customers happier • “Tue/Thu brainstorms” eventually nil as process perfected. • Soon conversions done by noon, Tue/Thu used for planning Intro to Massive Upgrades and Changes -- www.EverythingSysadmin.com

More Related