1 / 39

Don't gamble when it comes to reliability

Learn from Uber Site Reliability Engineering team about the importance of knowing when things are broken, avoiding global changes, and the effectiveness of moving traffic. Discover how to make your mitigations normal and ensure the reliability of your systems.

nakashima
Download Presentation

Don't gamble when it comes to reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Don't gamble when it comes to reliability Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer

  2. You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. >

  3. You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. > south

  4. You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” >

  5. You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” > commandeer conference room

  6. You are in an incident war room. Everything is f̨̧͔̳̝̫̳̥̭͙̰̜͕̝̣̗͍̤̺̦̼̘̘̲̮͔̠̩̞ͅ ̲̻͕͖̬̞͔̣̹̮͈̝̗͓̗̲̫̘̜̱̫̦̭̼̭̙͇͖̭̥̼ͅ ̧̨̨̢̧̗̞̝̤̙̗̗̣͎̗̭̯̘̤̩̬̦͍̖͈͚̭͎̱̠͜ ̧͖̳̗͙̖̣̙̣̦̣̦̬̲͇̰̩̠̠̳̼͕̗̲̞̼̭ͅͅ ̨̡̡̞̣̗̳̱̻̮͔̩̘̳̲͕̰͎͓͔̜̳͕͇͙̠̯͓͓͜ ̢̢̳̘̗̤͕̦͕̞̮̟͎̞̭̠̳̙̻̤̰̺̻̼ͅͅͅ ̢̨̨̣͓̠̦͈͙̝̯̮̮̘̙̥͍͖̯̟̪͖̣̜̬̪̩̭ͅ ̧̢̲̺͎̠̙̰͖͈̻̹̘̦̳͇͚̘̩̼͚̰̹̦͓̰̟̞̥͈͚͍̹͎͙̙͕̥ ̧̡̡̧͔̝͎̭̩̞̞̭̼̺̳͚͎̙͉͕̥̗̲͕͓̻͎͖͇̟̣̭͔̙̞̖̙̬̼͜ͅ ̨̡̭̭͕̫̭̫̘̠͙̖̪̺̻̥̪̼̘̭̖̩͉̦͜͜͜ͅͅ >

  7. Internal dashboards aren’t working

  8. Rule 1. Always Know When It’s Broken

  9. Rule 1. Always Know When It’s Broken Uber Datacenter Test Trips Alerts Tom’s Phone Blackbox monitoring system

  10. Many services are suddenly losing a high % of requests...

  11. ...so all these teams jump on a conference call.

  12. It began with a simple mistake.

  13. insert firewall all_services

  14. insert [into] firewall [group] all_services [group]

  15. insert [into] all_services [group] firewall [group]

  16. While fixing the Redis issue the firewall change was pushed to the shared service cluster globally.

  17. Rule 2. Avoid Global Changes

  18. Rule 2. Avoid Global Changes Unplanned Network provider failure e.g. dropping BGP routes, etc Switch/Router bug/failure Software bug Machine failure, TORS failure, etc Chiller failure, BMS failure, etc Grid failure, UPS failure, etc Planned You deployed You deployed You messed with it Internet Network Network Software Software Compute Compute Cooling Cooling Power Power

  19. Rule 2. Avoid Global Changes Having multiple protects you against fate, or other people’s mistakes. network providers availability zones datacenters racks switches routers service instances If you deploy all at once nothing can protect you from your own mistakes.

  20. ...so we find and fix the firewall issue.

  21. It’s been a long day, but everything is back to normal.

  22. You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. >

  23. You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. > drink beverage

  24. You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. >

  25. You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. > exit east

  26. A few rebooted machines just got the bad firewall config back in a single datacenter. So the team failed over to another datacenter.

  27. Rule 3. Moving traffic is faster than fixing

  28. Rule 3. Moving traffic is faster than fixing Consistent Available Partition Tolerant

  29. Rule 3. Moving traffic is faster than fixing “UNAVAILABLE” Consistent Available Partition Tolerant

  30. Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?

  31. Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?

  32. ...the traffic moved to the new datacenter starts failing...

  33. ...the existing traffic in the new datacenter starts failing.

  34. Frontend User Cache worker nginx worker HAProxy Health Checker User Authentication Flow iftoken in nginx cache do service req else do fast-auth req worker Varnish health check fast auth service request worker HAProxy Micro Services worker service

  35. Load balance if request hostname is localhost backend userCache0 { … } backend userCache1 { … } … director userCache round-robin { { .backend = userCache0; } { .backend = userCache1; } … { .backend = userCache11; } } sub vcl_recv { if (req.http.host ~ "^localhost") { set req.backend = userCache; } … }

  36. Rule 4. Make your mitigations normal

  37. Rule 4. Make your mitigations normal • Do test drills of your mitigations regularly • Test at peak traffic • Test without telling anyone else first • Make a plan of tests you need • Keep a log of what you’ve tested and when • Understand how mitigations affect your system • How much extra pressure does adding 25%, 50%, 100% extra traffic put on a datacenter • Load test, and capacity plan based on failover needs at peak not just peak • Load test even more often than you drill

  38. Rule 1. Always Know When It’s Broken Rule 2. Avoid Global Changes Rule 3. Moving traffic is faster than fixing Rule 4. Make your mitigations normal Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer

More Related