Case Study

Case Study How shifting to a DevOps Culture Enabled Performance and Capacity Improvements

Greg Burton Senior Software Engineer Hotel Infrastructure and Platform Ori Rawlings Senior Software Engineer Hotel Shopping and Optimization

Over $11.4B in total gross bookings in 2013 • Launched in 2001 • ~1,600 employees • Chicago, IL

What are we talking about today? Capacity!

What are we talking about today? Capacity! 2x

What are we talking about today? Development Operations Stuff the business needed to get done

What are we talking about today?

Global Platform

Legacy technology • 10+ years old • 3+ version control systems • 2+ RPC frameworks • 3+ JDK major versions • Developers cycle through, move on to other projects

The Chasm of Responsibility Development • Designing, building, and testing end user features • Patching bugs in apps • Requesting production changes • Responding to pages during on-call rotation • Monitoring site health and performance • Deploying changes to production • Network configuration • Database administration • Racking and bootstrapping servers Who? Operations

The Chasm of Responsibility Development • Performance tuning and optimization? • Performance regression testing? • Identifying capacity bottlenecks? • Maintenance of legacy performance tuning? Who? Operations

Sourcing several areas of expertise Development • Expertise on app internals • Know which features are important/can be changed • How services interact/collaborate to achieve functionality • Knows roughly where pressure/pain points are in infrastructure • Understanding of JVM tuning • Sense of hardware capability Who? Operations

Shared goal for Dev and Ops teams Double the hotel search capacity of our entire stack

Multiple limiting factors to reach search capacity goal Capacity Goal Limiting factors: various hosts, apps, databases in search stack

Limiting factor: database capacity (1) Database load exceeds max recommended levels during peak traffic periods. Limited options: no budget to buy additional database capacity Frustration: not familiar with application code, but intuitively suspects that there are huge inefficiencies Willing partner: excited to team up with developers to assess database load from multiple perspectives Qaiser, Database Architect

Limiting factor: database capacity (2) Leveraging the Top 10 Query Report

Limiting factor: database capacity (3) Instance 0 refresh process Internal cache #1 and #3 top queries were offline processes to reload in-memory caches #9 and #13 top queries were unnecessary and created by a bug in the code Instance 1 Database Internal cache … Instance N Internal cache

Results: database capacity (4) Changes roll out over this period 40% reduction in CPU usage 50% reduction in connection requests Total time spent: 4 weeks

Limiting factor: Hotel Search Engine (1) Instance 0 Legacy: for years, there was a reliance on horizontal scaling (adding more instances) to increase capacity. refresh process Internal cache Instance 1 Database Internal cache Consequence: horizontal scale was no longer a viable option because each new instance adds load to the database in order to refresh in-memory caches. … Instance N Internal cache

Limiting factor: Hotel Search Engine (2) Many ideas from many people… Bobby, Director of Operations Center Opinion: host is badly tuned and uses hardware inefficiently Opinion: developers are not aware of the costs of supporting so many instances. Chosen direction: draw a line in the sand. Can we meet the capacity goal by tuning the host on existing hardware? feels like

Limiting factor: Hotel Search Engine (3) Why hadn’t something as fundamental as JVM tuning been done? Ops perspective: Developers are responsible for using their hosts intelligently and keeping the JVM tuned when they add features. Dev perspective: Let’s focus on features. There are people in Operations who make sure the overall host operates well. Disconnect Outcome: JVM tuning fell into an ownerless chasm between dev and ops. Shift of perspective: a shared capacity goal encouraged a sense of host-level ownership.

Limiting factor: Hotel Search Engine (4) Investigation: We suspected that the JVM was not tuned well, but did not have a deep understanding of JVM dynamics. Our first thought: We don’t know much about doing this. Are there others at the company who are the “right people for this job”? Our next thought: Why not us? Why don’t we learn aggressively and become experts ourselves? Why don’t we apply a holistic, methodical approach to tuning?

Limiting factor: Hotel Search Engine (5) Hypothesis: New JVM tunings based on informed, methodical approach will increase capacity per instance. Live data Young generation heap space Young generation heap space Filling up with request objects Filling up with request objects Live data New tuning: larger young generation leads to fewer garbage collections and more memory recovered with each one (live objects have time to become garbage). Original tuning: undersized young generation of heap memory meant that garbage collections happened frequently and recovered little memory.

Let’s just try it out in production!

Limiting factor: Hotel Search Engine (7) Performance testing: There was no established culture of performance testing. We had to invest the time to establish it. Establish an environment: Operations and Development built it carefully to guarantee parity with production. This is the only way that testing can produce valid conclusions. Establish a trusted benchmark: Before introducing changes, we reproduced the capacity bottleneck that we were trying to eliminate in production. Drive load reliably and repeatably: We developed JMeter test suites. Source: Mature Optimization Handbook, Carlos Bueno, 2013

Limiting factor: Hotel Search Engine (8) Test results: We confirmed our hypothesis. Tuning reduced the garbage collection overhead, which produced extra capacity. Iterate quickly: the performance test environment allowed us to test many scenarios, zeroing in on the best one. Trust: we used the same hardware as production, scaled down proportionally

Limiting factor: Hotel Search Engine (9) Tunings deployed in production: this host was now capable of reaching our capacity goal, and we reduced the instance count by 25% Not so fast! This was only one host in a complex system. The next limiting factor awaits! Total time spent: 2 months

Limiting factor: Markup service (1) Another opportunity for tuning, similar to what was done with the Hotel Search Engine Payback for previous investments: was able to leverage this foundation to deploy this tuning effort in 1 week. Performance testing environment Growing knowledge base Result: we improved capacity to meet the target while reducing instance count from 110 to 16. Mature working relationship with Operations Total time spent: 1 week

Limiting factor: Markup service (2) Butwait…it was not a storybook ending. Performance actually got worse after our deployment. Why? Production did not match performance test results: an environmental difference meant that we missed an important production dynamic. Non-obvious problem: visible impacts at the application level had to be traced down to a cause at the operations level, a challenge requiring DevOps collaboration.

Limiting factor: Markup service (3) What we avoided What we did with Operations Working separately, passing deliverables over “the wall”.This minimizes learning and is not conducive to reaching a shared understanding of the problem. Guerilla-style meetings: Huddled around our desks and white boards whenever the work required it. Shrink the haystack: debated hypotheses and planned experiments to test them. Falsified hypotheses are still valuable because they tell you what the problem is not. Inconclusive experiments.Commit to a hypothesis and run an experiment, rather than “chasing” hypotheses in the middle of an experiment. Eliminate red herrings: did not accept hunches as facts, and did not settle for tempting workarounds

Limiting factor: Markup service (4) The root cause: TCP connection tracking table in the host servers was filling up and dropping packets Host servers After tuning: TCP connections were spread over fewer instances and fewer host servers. TCP connections for incoming requests VM 0 Instance 0 Instance 0 VM 15 Solution: Reduce the Conntrack table timeout for TCP connections in the TIME_WAIT state from several minutes to several seconds.

Limiting factor: Markup service (5) Why did we miss this in our performance testing? Hundreds of unique clients Tomcat behavior: Reuses connections up to a certain count. Above that, it opens and closes a connection for each request, generating much higher connection volume. Client Tomcat Production Environment Client Markup host … Client The environmental difference: Production environment had enough clients to cross the Tomcat connection threshold. The performance testing environment did not. Scaled down number of clients Tomcat Performance Testing Environment Client Markup host Total time spent: 4 weeks Fewer overall connections

Stability: Long-standing production issues (1) Learning to live with something that’s wrong, rather than fixing it. Why does it happen? Code problems that have operational consequences are not within the scope of Dev or Ops alone. We end up with operational workarounds to development problems.

Stability: Long-standing production issues (2) Specific problem: our webapp degrades after 1-2 weeks of uptime, requiring restarts to avoid impairment. Reproduced issue in performance testing environment on the first day. This allowed us to quickly iterate through experiments. Performance testing environment Dividend Limited our scope to this specific issue by determining its exact profile with a series of metrics. Dividend Growing knowledge base Mature working relationship with Operations Ops was a strong partner, driven by a growing confidence in our ability to solve famously difficult issues together. Dividend

Stability: Long-standing production issues (3) Valuable by-product: knowledge and approach began to spread rapidly as we started to see good results No previous exposure to working with an Operations mindset. Learned voraciously from our recent experiences Ben, Developer Moved beyond feature-level perspective to consider the operational health of the entire host and hardware Guilty phase: operations issues take time away from “real work” i.e. feature work Mental shift: operations-driven development is productive work

Stability: Long-standing production issues (4) Isolated the problem with experiments: leveraged the performance testing environment to test hypotheses Every test must be conclusive. Negative conclusions are also valuable because they shrink the size of the “haystack” in which we are searching. Inconclusive tests yield no progress. “This is probably not the cause” is not a valuable outcome. Document the results of every test. We conducted 22 different tests, and the results can easily be confused or forgotten. Exp 00: Reproduction of issue in performance testing environment Exp 01: Simplify JVM arguments to minimal set … Exp 12: Remove most flow execution listeners … Exp 21: Set CMSClassUnloadingEnabled to collect PermGen Exp 22: Reuse Xstream instance rather than one per request

Stability: Long-standing production issues (5) The root cause: by misusing a library, we introduced a custom classloader leak that gradually drove up minor garbage collection times. One line of code was the problem: it created a new object for every request, rather than reusing one Difficult for Ops to track down because the answer was in the application code. Settled for workarounds. Difficult for Dev to know or care about because the impact was limited to Ops. … Consequence: the misuse remained in the codebase for over a year before it was discovered! … Total time spent: 2 months

Key Take-aways • Establish shared goals to break down barriers between Devand Ops, which leads to shared understanding of problems • Commit to skepticism and use hypothesis testing as a contract to evaluate everyone’s ideas • When justifying time spent vs. value produced, account for the investments that produce reusable value • Recognize that bugs with no functional impact can have huge operational impact

Questions? We’re hiring! @OrbitzTalent http://careers.orbitz.com

Case Study

Case Study

Presentation Transcript

Case Study

Case Study

Case Study

Case Study

Case Study

Case Study

Case Study:

Case study

Case Study

CASE STUDY

Case Study

Case Study

Case Study

CASE STUDY

Case Study

Case Study

Case Study

Case Study

Case Study

Case Study

Case Study

Case Study