Just-In-Time Scalability: Agile Methods to Support Massive Growth

Just-In-Time Scalability: Agile Methods to Support Massive Growth

What is IMVU?

Behind the scenes... IMVU is LAMP, plus... • Perlbal • Memcached • Solr • MogileFS • plus... • BuildBot • eAccelerator • Linux (Debian) • memcached • Nagios • Perl • Roundup • rrd • Subversion • ADODB • b2evolution • Coppermine • feed2js • FreeTag • Incutio XML-RPC • jrcache • JSON-PHP • Magpie • osCommerce • phpBB • Phorum • SimpleTest • Selenium • Audiere • Boost • Cal3D • CFL • NSIS • Pixomatic • Python • pywin32 • SCons • wxPython

Before and After Architecture BeforeWe started with a small site, a mess of open source, and a small team that didn't know much about scaling. AfterWe ended with a large site, a medium sized team, and an architecture that has scaled. We never stopped. We used a roadmap and a compass, made weekly changes in direction, regularly shipped code on Wednesday to handle the next weekend's capacity constraints, and shipped new features the whole time.

Before and After Architecture (1/4) November

Before and After Architecture (2/4) December

Before and After Architecture (3/4) February

Before and After Architecture (4/4) May

Advanced planning vs. fast response “Driving” • Continuously figure out what is going to go wrong soon • Quickly fix it, without breaking something else • Get feedback along the way “Rocket ship” • Figure out in advance what is going to go wrong • Build a plan that prevents those things from happening • Execute your plan • Get feedback when done

Questions to ask “Driving” • How do you know you will be able to fix the problem in time? • How can you be sure you won't cause collateral damage? • How can you be sure you won't code yourself into a corner? “Rocket ship” • Are you sure you know what is going to happen? • Are you sure you can execute? • Can you afford it? • Do you need feedback?

Continuous Ship • Deploy new software quickly • At IMVU time from check-in to production = 20 minutes • Tell a good change from a bad change (quickly) • Revert a bad change quickly • Work in small batches • At IMVU, a large batch = 3 days worth of work • Break large projects down into small batches • Don't have the same problem twice – fix the root cause of each class of problems IMVU pushes code to production 20-30 times every day

Cluster Immune System What it looks like to ship one piece of code to production: • Run tests locally (SimpleTest, Selenium) • Everyone has a complete sandbox • Continuous Integration Server (BuildBot) • All tests must pass or “shut down the line” • Automatic feedback if the team is going too fast • Incremental deploy • Monitor cluster and business metrics in real-time • Reject changes that move metrics out-of-bounds • Alerting & Predictive monitoring (Nagios) • Monitor all metrics that stakeholders care about • If any metric goes out-of-bounds, wake somebody up • Use historical trends to predict acceptable bounds When customers see a failure: • Fix the problem for customers • Improve your defenses at each level

Case Study: Sharding Problem: Spread write queries across multiple databases Solution: • Intercept and redirect queries based on SQL comments • Move one table or sub-system at a time • Our experience was one engineer horizontally partitions one table or small sub-system in one week • New engineers figure this out in about 5 minutes db_query(“INSERT INTO inventory (customers_id, products_id) VALUES ($customer_id, $product_id)"); db_query("/*shard customer://$customer_id */ INSERT INTO inventory (customers_id, products_id) VALUES ($customer_id, $product_id)"); • Learning: cross shard joins & transactions aren’t required

Case Study: Caching Problem: Cache frequently read data to memcached Solution: • Intercept and cache queries based on SQL comments db_query_cache(BUDDY_CACHE_TIME, "/*shard customer://$customer_id */ /*cache-class customer://$customer_id/buddies */ SELECT friend_id, buddy_order FROM customers_friends WHERE customers_id=$customer_id"); ----------------- db_query(“/*shard customer://$customer_id */ DELETE FROM customers_friends WHERE customers_id = $customer_id AND friend_id = $friend_id”); db_flush_cacheclass("customer://$customer_id/buddies”); • Learning: Flushing cache critical to users and performance • When a customer spends $24.95, they want the benefits immediately • Learning: Test the cache behavior for critical systems

Case Study: Steering Data Design Problem: Improve database schemas and data design to meet scalability requirements without downtime Solution: • Measure to find the real problems (harder than it sounds) • Migrate to new design that takes advantage of sharding and/or caching

Case Study: Steering Data Design

Case Study: Steering Data Design Problem: You can’t bulk move large frequently accessed data Solution: • Copy on read • Use when you are read bound • Reads check cache, new location, and copy to new location if missing • Writes go to new location if data has been migrated, otherwise old • Copy on write • Use when you are write bound • Reads check cache, new location, then old location • Writes go to new location, copying to new location if missing • Copy all • Use when file system fills up • Reads & writes go to new location, falling back to old location if missing • Cron copies data a few records at a time

“Thank You for Listening!”

Just-In-Time Scalability: Agile Methods to Support Massive Growth