380 likes | 491 Views
GDC Tutorial, 2005. Building Multi-Player Games. Case Study: The Sims Online Lessons Learned , Larry Mellon. TSO: Overview . Initial team: little to no MMP experience
E N D
GDC Tutorial, 2005. Building Multi-Player Games Case Study: The Sims Online Lessons Learned, Larry Mellon
TSO: Overview • Initial team: little to no MMP experience • Engineering estimate: switching from 4-8 player peer to peer to MMP client/server would take no additional development time! • No code / architecture / tool support for • Long-term, continually changing nature of game • Non-deterministic execution, dual platform (win32 / Linux) • Overall process designed for single-player complexity, small development team • Limited nightly builds, minimal daily testing • Limited design reviews, limited scalability testing, no “maintainable/extensible” impl. requirement
TSO: Case Study Outline(Lessons Learned) Poorly designed SP MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content
Scalability(Team Size & Code Size) • What were the problems • Side effect breaks & ability to work in parallel • Limited encapsulation + poor testability + non-determinism = TROUBLE • Independent module design & impact on overall system (initially, no system architect) • #include structure • win32 / Linux, compile times, pre-compiled headers, ... • What worked • Move to new architecture via Refactoring & Scaffolding • HSB, incSync, nullView Simulator, nullView client, … • Rolling integrations: never dark • Sandboxing & pumpkins
Scalability (Build & Distribution) • To developers, customers & fielded servers • What didn’t work (well enough) • Pulling builds from developer’s workstations • Shell scripts & manual publication • What worked well • Heavy automation with web tracking • Repeatability, Speed, Visibility • Hierarchies of promotion & test
Scalability (Architecture) • Logical versus physical versus code structure • Only physical was not a major, MAJOR issue • Logical: Replicated computing vs client / server • Security & stability implications • Code: Client / server isolation & code sharing • Multiple, concurrent logic threads were sharing code&data, each impacting the others • Nullview client & simulator • Regulators vs Protocols: bug counts & state machines
Client/Server: Sim Nice Undemocratic Request/ Command Client Client Client Go to final architecture ASAP Multiplayer: Client Sim Evolve Here be Sync Hell Client Sim Client Sim Client Sim
Evolve Final Architecture ASAP:Make Everything Smaller&Separate
More Packets!! Final Architecture ASAP:Reduce Complexity of Branches Shared Code Packet Arrival If (client) Client & server teams would constantly break each other via changes to shared state&code Shared State If (server) #ifdef (nullview) Client Event Server Event
Final Architecture ASAP:“Refactoring” • Decomposed into Multiple dll’s • Found the Simulator • Interfaces • Reference Counting • Client/Server subclassing • How it helped: • Reduced coupling. Even reduced compile times! • Developers in different modules broke each other less often. • We went everywhere and learned the code base.
Final Architecture ASAP:It Had to Always Run • Initially clients wouldn’t behave predictably • We could not even play test • Game design was demoralized • We needed a bridge, now! ? ?
Final Architecture ASAP:Incremental Sync • A quick temporary solution… • Couldn’t wait for final system to be finished • High overhead, couldn’t ship it • We took partial state snapshots on the server and restored to them on the client • How it helped: • Could finally see the game as it would be. • Allowed parallel game design and coding • Bought time to lay in the “right” stuff.
Architecture: Conclusions • Keep it simple, stupid! • Client/server • Keep it clean • DLL/module integration points • #ifdef’s must die! • Keep it alive • Plan for a constant system architect role: review all modules for impact on team, other modules & extensibility • Expose & control all inter-process communication • See Regulators: state machines that control transactions
TSO: Case Study Outline(Lessons Learned) Poorly designed SP MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content
Visibility • Problems • Debugging a client/server issue was very slow & painful • Knowing what to work on next was largely guesswork • Reproducing system failures from live environment • Knowing how one build or server cluster differed from another was again largely guesswork • What we did that worked • Log / crash aggregators & filters • Live “critical event” monitor • Esper: live player & engine metrics • Repeatable load testing • Web-based Dashboard: health, status, where is everything • Fully automated build & publish procedures
Visibility via “Bread Crumbs”: Aggregated Instrumentation Flags Trouble Spots Server Crash
Quickly Find Trouble Spots DB byte count oscillates out of control, server crashes
Drill Down For Details A single DB Request is clearly at fault
TSO: Case Study Outline(Lessons Learned) Poorly designed SP MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content
Testability • Development, release, load: all show stopper problems • QA coordination / speed / cost • Repeatablity, non-determinism • Need for many, many tests per day, each with multiple inputs (two to two thousand players per test)
Testability: What Worked • Automated testing for repeatablity & scale • Scriptable test clients: mirrored actual user play sessions • Changed the game’s architecture to increase testability • External test harnesses to control 50+ test clients per CPU, 4,000+ per session • Push-button UI to configure, run & analyze tests (developer & QA) • Constantly updated Baselines, with “Monkey Test” stats • Pre-checkin regression • QA: web-driven state machine to control testers & collect/publish results • What didn’t work • Event Recorders, unit testing • Manual-only testing
MMP Automated Testing: Approach • Push-button ability to run large-scale, repeatable tests • Cost • Hardware / Software • Human resources • Process changes • Benefit • Accurate, repeatable measurable tests during development and operations • Stable software, faster, measurable progress • Base key decisions on fact, not opinion
Why Spend The Time & Money? • System complexity, non-determinism, scale • Tests provide hard data in a confusing sea of possibilities • End users: high Quality of Service bar • Dev team: greater comfort & confidence • Tools augment your team’s ability to do their jobs • Find problems faster • Measure / change / measure: repeat as necessary • Production & executives: come to depend on this data to a high degree
Scripted Test Clients • Scripts are emulated play sessions: just like somebody plays the game • Command steps: what the player does to the game • Validation steps: what the game should do in response
Scripts TailoredTo Each Test Application • Unit testing: 1 feature = 1 script • Load testing: Representative play session • The average Joe, times thousands • Shipping quality: corner cases, feature completeness • Integration: test code changes for catastrophic failures
Test Client Game Client Script Engine Game GUI State State Client-Side Game Logic Scripted Players: Implementation Commands Presentation Layer
Amount of work done Time Target Launch Project Start MMP Developer Efficiency Strong test support Weak test support Process Shift: Earlier Tools Investment Equals More Gain Not Good Enough
Process Shifts: Automated Testing Changes The Shape Of The Development Progress Curve Stability (Code Base & Servers) Keep Developers moving forward, not bailing water Scale & Feature Completeness Focus Developers on key, measurable roadblocks
First Passing Test Now Process Shift: Measurable Targets, Projected Trend Lines Target Complete Core Functionality Tests, Any Feature (e.g. # clients) Time Any Time (e.g. Alpha) Actionable progress metrics, early enough to react
Process Shift: Load Testing (Before Paying Customers Show Up) Expose issues that only occur at scale Establish hardware requirements Establish play is acceptable @ scale
TSO: Case Study Outline(Lessons Learned) Poorly designed SP MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content
User Data • Oops! • Users stored much more data (with much more variance) that we had planned for • Caused many DB failures, city failures • BIG problem: their persistent data has to work, always, across all builds & DB instances • What helped • Regression testing, each build, against live set of user data • What would have helped more • Sanity checks against the DB • Range checks against user data • Better code & architecture support for validation of user data
Patching / New Content / Custom Content • Oops! • Initial Patch budget of 1Meg blown in 1st week of operations • New Content required stronger, more predictable process • Custom Content required infrastructure able to easily add new content, on the fly • Key Issue: all effort had gone into going Live, not creating a sustainable process once Live • Conclusion: designing these in would have been much easier than retrofitting…
Lessons Learned • autoTest: Scripted test clients and instrumented code rock! • Collection, aggregation and display of test data is vital in making decisions on a day to day basis • Lessen the panic • Scale&Break is a very clarifying experience • Stable code&servers greatly ease the pain of building a MMP game • Hard data (notopinion) is both illuminating and calming • autoBuild: make it pushbutton with instant web visibility • Use early, use often to get bugs out before going live • Budget for a strong architect role & a strong design review process for the entire game lifecycle • Scalability, testability, patching & new content & long-term persistence are requirements: MUCH cheaper to design in than frantic retrofitting • KISS principle is mandatory, as is expecting changes
Lessons Learned • Visibility: tremendous volumes of data require automated collection&summarization • Provide drill-down access to details from summary view web pages • Get some people on board who’ve been burned before: a lot of TSO’s pain could have been easily avoided, but little distributed system experience & MMP design issues existed in early phases of project • Fred Brooks, the 31st programmer • Strong tools & process pays off for large teams & long-term operations • Measure & improve your workspace, constantly • Non-determinism is painful & unavoidable • Minimize impact via explicit design support & use strong, constant calibration to understand it
Biggest Wins Code Isolation Scaffolding Tools: Build / Test / Measure, Information Management Pre-Checkin Regression / Load Testing
Biggest Losses • Architecture: Massively peer to peer • Early lack of tools • #ifdef across platform / function “Critical Path” dependencies More Details: www.maggotranch.com/MMP (3 TSO Lessons Learned talks)