390 likes | 406 Views
Learn from Uber Site Reliability Engineering team about the importance of knowing when things are broken, avoiding global changes, and the effectiveness of moving traffic. Discover how to make your mitigations normal and ensure the reliability of your systems.
E N D
Don't gamble when it comes to reliability Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer
You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. >
You are in a 5th floor office. You overhear some office workers: “...service can’t talk to Redis”. Exits are North, South, East, and West. > south
You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” >
You approach the elevator. You can smell lunch. You receive a page: “Blackbox is alerting in 6 cities” > commandeer conference room
You are in an incident war room. Everything is f̨̧͔̳̝̫̳̥̭͙̰̜͕̝̣̗͍̤̺̦̼̘̘̲̮͔̠̩̞ͅ ̲̻͕͖̬̞͔̣̹̮͈̝̗͓̗̲̫̘̜̱̫̦̭̼̭̙͇͖̭̥̼ͅ ̧̨̨̢̧̗̞̝̤̙̗̗̣͎̗̭̯̘̤̩̬̦͍̖͈͚̭͎̱̠͜ ̧͖̳̗͙̖̣̙̣̦̣̦̬̲͇̰̩̠̠̳̼͕̗̲̞̼̭ͅͅ ̨̡̡̞̣̗̳̱̻̮͔̩̘̳̲͕̰͎͓͔̜̳͕͇͙̠̯͓͓͜ ̢̢̳̘̗̤͕̦͕̞̮̟͎̞̭̠̳̙̻̤̰̺̻̼ͅͅͅ ̢̨̨̣͓̠̦͈͙̝̯̮̮̘̙̥͍͖̯̟̪͖̣̜̬̪̩̭ͅ ̧̢̲̺͎̠̙̰͖͈̻̹̘̦̳͇͚̘̩̼͚̰̹̦͓̰̟̞̥͈͚͍̹͎͙̙͕̥ ̧̡̡̧͔̝͎̭̩̞̞̭̼̺̳͚͎̙͉͕̥̗̲͕͓̻͎͖͇̟̣̭͔̙̞̖̙̬̼͜ͅ ̨̡̭̭͕̫̭̫̘̠͙̖̪̺̻̥̪̼̘̭̖̩͉̦͜͜͜ͅͅ >
Rule 1. Always Know When It’s Broken Uber Datacenter Test Trips Alerts Tom’s Phone Blackbox monitoring system
While fixing the Redis issue the firewall change was pushed to the shared service cluster globally.
Rule 2. Avoid Global Changes Unplanned Network provider failure e.g. dropping BGP routes, etc Switch/Router bug/failure Software bug Machine failure, TORS failure, etc Chiller failure, BMS failure, etc Grid failure, UPS failure, etc Planned You deployed You deployed You messed with it Internet Network Network Software Software Compute Compute Cooling Cooling Power Power
Rule 2. Avoid Global Changes Having multiple protects you against fate, or other people’s mistakes. network providers availability zones datacenters racks switches routers service instances If you deploy all at once nothing can protect you from your own mistakes.
It’s been a long day, but everything is back to normal.
You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. >
You are on the 6th floor patio. In front of you is <a PR approved beverage>. Exits are West. > drink beverage
You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. >
You are on the 6th floor patio. You receive a page: “Blackbox is alerting in 8 cities” Exits are West. > exit east
A few rebooted machines just got the bad firewall config back in a single datacenter. So the team failed over to another datacenter.
Rule 3. Moving traffic is faster than fixing Consistent Available Partition Tolerant
Rule 3. Moving traffic is faster than fixing “UNAVAILABLE” Consistent Available Partition Tolerant
Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?
Rule 3. Moving traffic is faster than fixing client client Can A client Proceed?
...the traffic moved to the new datacenter starts failing...
...the existing traffic in the new datacenter starts failing.
Frontend User Cache worker nginx worker HAProxy Health Checker User Authentication Flow iftoken in nginx cache do service req else do fast-auth req worker Varnish health check fast auth service request worker HAProxy Micro Services worker service
Load balance if request hostname is localhost backend userCache0 { … } backend userCache1 { … } … director userCache round-robin { { .backend = userCache0; } { .backend = userCache1; } … { .backend = userCache11; } } sub vcl_recv { if (req.http.host ~ "^localhost") { set req.backend = userCache; } … }
Rule 4. Make your mitigations normal • Do test drills of your mitigations regularly • Test at peak traffic • Test without telling anyone else first • Make a plan of tests you need • Keep a log of what you’ve tested and when • Understand how mitigations affect your system • How much extra pressure does adding 25%, 50%, 100% extra traffic put on a datacenter • Load test, and capacity plan based on failover needs at peak not just peak • Load test even more often than you drill
Rule 1. Always Know When It’s Broken Rule 2. Avoid Global Changes Rule 3. Moving traffic is faster than fixing Rule 4. Make your mitigations normal Tom Croucher Uber Site Reliability Engineering tomc@uber.com @sh1mmer