Atsipra šau...

Atsiprašau... ... bet ši skaidrė bus vienintelė skaidrė lietuvių kalba

When things break It’s not big deal! Resilience and Remediation

Agenda • About me • Intro to resilience • Credits to authors • Let it crash! • Distributed resilience • Remediation patterns • QA

About me • BS & MS in Program engineering KTU • Developer for 10+ years… • …and team-lead, tech-lead, consultant, agile pioneer, certified scrum master, architect • Backend, distributed systems developer for last 4+ years

hard.core team in BI department • Data warehouse and business intelligence • Tens of TB of information • 500 million transactions per day • Various availability requirements • Always online, always consistent, always up to date

Reliability • a.k.a. availability • Most critical component has two nines (.99) • Gmail had 0.99984 @ 2010

QCon London 2011 • My goal • Bring new ideas • Leave comfort zone • Evolve • Visit it! • InfoQ.com • Credits to authors

Things break, live with it

Let it crash! Why should we?

Let it crash! Erlang is a programming language used to build massively scalable real-time systems with requirements on high availability

Let it Crash! Idea behind • The world is concurrent • Things in the world don’t share data • Things communicate with messages • Things fail

Let it crash! Erlang • Runtime system with functional language • Simple threading • Message passing (no locks!) • CouchDB, Riak, Facebook chat (ejabberd), RabbitMQ, GitHub • We don’t use it :)

Let it Crash! Defensive programming • How to do it - write code to… • solve a problem • check all possible inputs • handle all possible errors/exceptions public void DoStuff(string message, Action<string> action) { if (String.IsNullOrEmpty(message)) throw new ArgumentNullException("message", "The message is null or empty."); if (action == null) throw new ArgumentNullException("action", "The action is null."); try { action(message); } catch (BoringMessageException) { return; } catch (OffensiveMessageException exc) { this.ReportOffender(SystemState.CurrentUser); return; } catch (ActionMonkeyIsBusyException exc) { throw new RetryLaterException(exc, DateTime.Now.AddYears(1), action, message); } catch { throw; } } public void DoStuff(string message, Action<string> action) { if (String.IsNullOrEmpty(message)) throw new ArgumentNullException("message", "The message is null or empty."); if (action == null) throw new ArgumentNullException("action", "The action is null."); action(message); } public void DoStuff(string message, Action<string> action ) { action(message); } public void DoStuff(string message,Action<string> action ) { action(message); }

Let it Crash! Defensive programming • How to do it - write code to… • solve a problem • check all possible inputs • handle all possible errors/exceptions • Result • More code, more bugs • Obscure code, unclear logic • Error handling is poorly tested • It is very hard to defend against everything

Let it Crash! Erlang way of managing applications

Let it Crash! According to Joe Armstrong • Do not code defensively • If you can’t do what you want, die! • Nobody should stop you from crashing • Let other process do recovery But be reasonable…

Story time! Legacycache

Let it crash! Couple of hints

Let it Crash! My takeaways • “Let it crash” is not • a design crutch • an excuse to lose vital data • Principle of Least Surprise • Handle what you can • Let someone else do the rest

Distributed Systems, Databases and Resilience … or just Distributed Resilience

Distributed Resilience Plan your failures • You can usually prevent full system crash • But how will it behave on partial failure? • Plan and understand… • …before the users tell you • You think you know what will break… • … you’re probably wrong

Distributed Resilience Failure is not bad • Best way to avoid failures is to fail constantly! • Netflix “Chaos Monkey” • Navigates in infrastructure • Kills random processes • Monitors how system recovers

Distributed Resilience Harvest and yield • Harvest is your data • Yield is possibility to retrieve data

Distributed Resilience It’s just a flesh wound! Problems • Finding single point of failures • Security • Configuration • Administration, Shared, Central • Removing • Hard to avoid, even harder to remove • Minimization is the target

Distributed Resilience Share nothing - sharding • Pros • Index sizes • Security (OotB)? • Leverages HW risk • Cons • More HW • Harder than looks • Auto is even harder • Shared info • Complex releases Proxy Config

Distributed resilience Load balancing • Pros • Rather easy • Lots of products • Cons • Session is killer • Configuration Service Service Service Service

Distributed Resilience Mirrors & Replication • Pros – Various! • Lots of products • Very secure? • Very fast? • Cons – Various! • Hardware • Slow? • Inconsistent?

Distributed resilience NoSQL CouchDB Hbase • Not (always) ACID • Atomicity • Consistency • Isolation • Durability • BASE • Basically Available • Soft state • Eventual consistency • CAP theorem • a.k.a. Brewer's theorem • Consistency • Availability • Partition tolerance Memcached Cassandra Redis MarkLogic BigTable riak mongoDB SimpleDB

Next slide :) Sample time!

Samples from adform • Sensitive statistics • Real time bidding Google User sees Node2 Node1 Bid Service1 Bid Service2

Remediation patterns For better releases

Remediation patterns Vocabulary • Remediation • Recovery to known state after a failed release • Recovery • Returning system to working state • “Fixing sh*t when it breaks” • It’s all about… • Prevention • Patterns of low risk release • Patterns of incremental delivery Yea!

Remediation patterns Background • Release is risky operation • The best way to fail release is to do it once • Don’t touch it if it works! • Agile • Time to market • Continuous deployment • 20 releases in 2 weeks @ adform

Remediation patterns Prevention

Remediation patterns Problems • The hard bits: • Testing on production environment • Create maintainable acceptance tests • Testing cross-functional requirements

Remediation patterns Reducing risk • Canary releasing • Partial release • Observe effects • Release for the rest

Remediation patterns Reducing risk • Dark launch • Release to invisible infrastructure • Direct some of real load to dark-side

Remediation patterns Reducing risk • Feature toggles • Develop on trunk, or else… • Feature toggle / branch by abstraction

Monitoring Separate slide, cause it is so damn important! • Monitoring is essential • Remote watchdogs • Watchdogs for watchdogs • Business metrics are essential • Root cause analysis • The game: why? why? why? • Root cause graph

Next slide :) Sample time!

No tests in front Testing identified 2 fires Ownership problems Delayed checklist No review on checklists Different knowledge levels Only one person writing checklist Developers don’t use it

Final word Troll appears Things break Learn to bend (one way or another)

Almost done Shameless ads

We are recruiting! • 100+ in Lithuania • 60+ in development • Architects • Analytics • Programmers • QA • http://www.adform.com/site/company/careers/

Atsipra šau...

Atsipra šau...

Presentation Transcript