480 likes | 627 Views
Atsipra šau. ... b et ši skaidrė bus vienintelė skaidrė lietuvių kalba. When things break. It’s not big deal!. Resilience and Remediation. Agenda. About me Intro to resilience Credits to authors Let it crash! Distributed resilience Remediation patterns QA. About me.
E N D
Atsiprašau... ... bet ši skaidrė bus vienintelė skaidrė lietuvių kalba
When things break It’s not big deal! Resilience and Remediation
Agenda • About me • Intro to resilience • Credits to authors • Let it crash! • Distributed resilience • Remediation patterns • QA
About me • BS & MS in Program engineering KTU • Developer for 10+ years… • …and team-lead, tech-lead, consultant, agile pioneer, certified scrum master, architect • Backend, distributed systems developer for last 4+ years
hard.core team in BI department • Data warehouse and business intelligence • Tens of TB of information • 500 million transactions per day • Various availability requirements • Always online, always consistent, always up to date
Reliability • a.k.a. availability • Most critical component has two nines (.99) • Gmail had 0.99984 @ 2010
QCon London 2011 • My goal • Bring new ideas • Leave comfort zone • Evolve • Visit it! • InfoQ.com • Credits to authors
Let it crash! Why should we?
Let it crash! Erlang is a programming language used to build massively scalable real-time systems with requirements on high availability
Let it Crash! Idea behind • The world is concurrent • Things in the world don’t share data • Things communicate with messages • Things fail
Let it crash! Erlang • Runtime system with functional language • Simple threading • Message passing (no locks!) • CouchDB, Riak, Facebook chat (ejabberd), RabbitMQ, GitHub • We don’t use it :)
Let it Crash! Defensive programming • How to do it - write code to… • solve a problem • check all possible inputs • handle all possible errors/exceptions public void DoStuff(string message, Action<string> action) { if (String.IsNullOrEmpty(message)) throw new ArgumentNullException("message", "The message is null or empty."); if (action == null) throw new ArgumentNullException("action", "The action is null."); try { action(message); } catch (BoringMessageException) { return; } catch (OffensiveMessageException exc) { this.ReportOffender(SystemState.CurrentUser); return; } catch (ActionMonkeyIsBusyException exc) { throw new RetryLaterException(exc, DateTime.Now.AddYears(1), action, message); } catch { throw; } } public void DoStuff(string message, Action<string> action) { if (String.IsNullOrEmpty(message)) throw new ArgumentNullException("message", "The message is null or empty."); if (action == null) throw new ArgumentNullException("action", "The action is null."); action(message); } public void DoStuff(string message, Action<string> action ) { action(message); } public void DoStuff(string message,Action<string> action ) { action(message); }
Let it Crash! Defensive programming • How to do it - write code to… • solve a problem • check all possible inputs • handle all possible errors/exceptions • Result • More code, more bugs • Obscure code, unclear logic • Error handling is poorly tested • It is very hard to defend against everything
Let it Crash! Erlang way of managing applications
Let it Crash! According to Joe Armstrong • Do not code defensively • If you can’t do what you want, die! • Nobody should stop you from crashing • Let other process do recovery But be reasonable…
Story time! Legacycache
Let it crash! Couple of hints
Let it Crash! My takeaways • “Let it crash” is not • a design crutch • an excuse to lose vital data • Principle of Least Surprise • Handle what you can • Let someone else do the rest
Distributed Systems, Databases and Resilience … or just Distributed Resilience
Distributed Resilience Plan your failures • You can usually prevent full system crash • But how will it behave on partial failure? • Plan and understand… • …before the users tell you • You think you know what will break… • … you’re probably wrong
Distributed Resilience Failure is not bad • Best way to avoid failures is to fail constantly! • Netflix “Chaos Monkey” • Navigates in infrastructure • Kills random processes • Monitors how system recovers
Distributed Resilience Harvest and yield • Harvest is your data • Yield is possibility to retrieve data
Distributed Resilience It’s just a flesh wound! Problems • Finding single point of failures • Security • Configuration • Administration, Shared, Central • Removing • Hard to avoid, even harder to remove • Minimization is the target
Distributed Resilience Share nothing - sharding • Pros • Index sizes • Security (OotB)? • Leverages HW risk • Cons • More HW • Harder than looks • Auto is even harder • Shared info • Complex releases Proxy Config
Distributed resilience Load balancing • Pros • Rather easy • Lots of products • Cons • Session is killer • Configuration Service Service Service Service
Distributed Resilience Mirrors & Replication • Pros – Various! • Lots of products • Very secure? • Very fast? • Cons – Various! • Hardware • Slow? • Inconsistent?
Distributed resilience NoSQL CouchDB Hbase • Not (always) ACID • Atomicity • Consistency • Isolation • Durability • BASE • Basically Available • Soft state • Eventual consistency • CAP theorem • a.k.a. Brewer's theorem • Consistency • Availability • Partition tolerance Memcached Cassandra Redis MarkLogic BigTable riak mongoDB SimpleDB
Next slide :) Sample time!
Samples from adform • Sensitive statistics • Real time bidding Google User sees Node2 Node1 Bid Service1 Bid Service2
Remediation patterns For better releases
Remediation patterns Vocabulary • Remediation • Recovery to known state after a failed release • Recovery • Returning system to working state • “Fixing sh*t when it breaks” • It’s all about… • Prevention • Patterns of low risk release • Patterns of incremental delivery Yea!
Remediation patterns Background • Release is risky operation • The best way to fail release is to do it once • Don’t touch it if it works! • Agile • Time to market • Continuous deployment • 20 releases in 2 weeks @ adform
Remediation patterns Prevention
Remediation patterns Prevention
Remediation patterns Problems • The hard bits: • Testing on production environment • Create maintainable acceptance tests • Testing cross-functional requirements
Remediation patterns Reducing risk • Canary releasing • Partial release • Observe effects • Release for the rest
Remediation patterns Reducing risk • Dark launch • Release to invisible infrastructure • Direct some of real load to dark-side
Remediation patterns Reducing risk • Feature toggles • Develop on trunk, or else… • Feature toggle / branch by abstraction
Monitoring Separate slide, cause it is so damn important! • Monitoring is essential • Remote watchdogs • Watchdogs for watchdogs • Business metrics are essential • Root cause analysis • The game: why? why? why? • Root cause graph
Next slide :) Sample time!
No tests in front Testing identified 2 fires Ownership problems Delayed checklist No review on checklists Different knowledge levels Only one person writing checklist Developers don’t use it
Final word Troll appears Things break Learn to bend (one way or another)
Almost done Shameless ads
We are recruiting! • 100+ in Lithuania • 60+ in development • Architects • Analytics • Programmers • QA • http://www.adform.com/site/company/careers/