1 / 38

GDC Tutorial, 2005. Building Multi-Player Games

GDC Tutorial, 2005. Building Multi-Player Games. Case Study: The Sims Online Lessons Learned , Larry Mellon. TSO: Overview . Initial team: little to no MMP experience

thad
Download Presentation

GDC Tutorial, 2005. Building Multi-Player Games

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GDC Tutorial, 2005. Building Multi-Player Games Case Study: The Sims Online Lessons Learned, Larry Mellon

  2. TSO: Overview • Initial team: little to no MMP experience • Engineering estimate: switching from 4-8 player peer to peer to MMP client/server would take no additional development time! • No code / architecture / tool support for • Long-term, continually changing nature of game • Non-deterministic execution, dual platform (win32 / Linux) • Overall process designed for single-player complexity, small development team • Limited nightly builds, minimal daily testing • Limited design reviews, limited scalability testing, no “maintainable/extensible” impl. requirement

  3. TSO: Case Study Outline(Lessons Learned) Poorly designed SP  MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

  4. Scalability(Team Size & Code Size) • What were the problems • Side effect breaks & ability to work in parallel • Limited encapsulation + poor testability + non-determinism = TROUBLE • Independent module design & impact on overall system (initially, no system architect) • #include structure • win32 / Linux, compile times, pre-compiled headers, ... • What worked • Move to new architecture via Refactoring & Scaffolding • HSB, incSync, nullView Simulator, nullView client, … • Rolling integrations: never dark • Sandboxing & pumpkins

  5. Scalability (Build & Distribution) • To developers, customers & fielded servers • What didn’t work (well enough) • Pulling builds from developer’s workstations • Shell scripts & manual publication • What worked well • Heavy automation with web tracking • Repeatability, Speed, Visibility • Hierarchies of promotion & test

  6. Scalability (Architecture) • Logical versus physical versus code structure • Only physical was not a major, MAJOR issue • Logical: Replicated computing vs client / server • Security & stability implications • Code: Client / server isolation & code sharing • Multiple, concurrent logic threads were sharing code&data, each impacting the others • Nullview client & simulator • Regulators vs Protocols: bug counts & state machines

  7. Client/Server: Sim Nice Undemocratic Request/ Command Client Client Client Go to final architecture ASAP Multiplayer: Client Sim Evolve Here be Sync Hell Client Sim Client Sim Client Sim

  8. Evolve Final Architecture ASAP:Make Everything Smaller&Separate

  9. More Packets!! Final Architecture ASAP:Reduce Complexity of Branches Shared Code Packet Arrival If (client) Client & server teams would constantly break each other via changes to shared state&code Shared State If (server) #ifdef (nullview) Client Event Server Event

  10. Final Architecture ASAP:“Refactoring” • Decomposed into Multiple dll’s • Found the Simulator • Interfaces • Reference Counting • Client/Server subclassing • How it helped: • Reduced coupling. Even reduced compile times! • Developers in different modules broke each other less often. • We went everywhere and learned the code base.

  11. Final Architecture ASAP:It Had to Always Run • Initially clients wouldn’t behave predictably • We could not even play test • Game design was demoralized • We needed a bridge, now! ? ?

  12. Final Architecture ASAP:Incremental Sync • A quick temporary solution… • Couldn’t wait for final system to be finished • High overhead, couldn’t ship it • We took partial state snapshots on the server and restored to them on the client • How it helped: • Could finally see the game as it would be. • Allowed parallel game design and coding • Bought time to lay in the “right” stuff.

  13. Architecture: Conclusions • Keep it simple, stupid! • Client/server • Keep it clean • DLL/module integration points • #ifdef’s must die! • Keep it alive • Plan for a constant system architect role: review all modules for impact on team, other modules & extensibility • Expose & control all inter-process communication • See Regulators: state machines that control transactions

  14. TSO: Case Study Outline(Lessons Learned) Poorly designed SP  MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

  15. Visibility • Problems • Debugging a client/server issue was very slow & painful • Knowing what to work on next was largely guesswork • Reproducing system failures from live environment • Knowing how one build or server cluster differed from another was again largely guesswork • What we did that worked • Log / crash aggregators & filters • Live “critical event” monitor • Esper: live player & engine metrics • Repeatable load testing • Web-based Dashboard: health, status, where is everything • Fully automated build & publish procedures

  16. Visibility via “Bread Crumbs”: Aggregated Instrumentation Flags Trouble Spots Server Crash

  17. Quickly Find Trouble Spots DB byte count oscillates out of control, server crashes

  18. Drill Down For Details A single DB Request is clearly at fault

  19. TSO: Case Study Outline(Lessons Learned) Poorly designed SP  MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

  20. Testability • Development, release, load: all show stopper problems • QA coordination / speed / cost • Repeatablity, non-determinism • Need for many, many tests per day, each with multiple inputs (two to two thousand players per test)

  21. Testability: What Worked • Automated testing for repeatablity & scale • Scriptable test clients: mirrored actual user play sessions • Changed the game’s architecture to increase testability • External test harnesses to control 50+ test clients per CPU, 4,000+ per session • Push-button UI to configure, run & analyze tests (developer & QA) • Constantly updated Baselines, with “Monkey Test” stats • Pre-checkin regression • QA: web-driven state machine to control testers & collect/publish results • What didn’t work • Event Recorders, unit testing • Manual-only testing

  22. MMP Automated Testing: Approach • Push-button ability to run large-scale, repeatable tests • Cost • Hardware / Software • Human resources • Process changes • Benefit • Accurate, repeatable measurable tests during development and operations • Stable software, faster, measurable progress • Base key decisions on fact, not opinion

  23. Why Spend The Time & Money? • System complexity, non-determinism, scale • Tests provide hard data in a confusing sea of possibilities • End users: high Quality of Service bar • Dev team: greater comfort & confidence • Tools augment your team’s ability to do their jobs • Find problems faster • Measure / change / measure: repeat as necessary • Production & executives: come to depend on this data to a high degree

  24. Scripted Test Clients • Scripts are emulated play sessions: just like somebody plays the game • Command steps: what the player does to the game • Validation steps: what the game should do in response

  25. Scripts TailoredTo Each Test Application • Unit testing: 1 feature = 1 script • Load testing: Representative play session • The average Joe, times thousands • Shipping quality: corner cases, feature completeness • Integration: test code changes for catastrophic failures

  26. Test Client Game Client Script Engine Game GUI State State Client-Side Game Logic Scripted Players: Implementation Commands Presentation Layer

  27. Amount of work done Time Target Launch Project Start MMP Developer Efficiency Strong test support Weak test support Process Shift: Earlier Tools Investment Equals More Gain Not Good Enough

  28. Process Shifts: Automated Testing Changes The Shape Of The Development Progress Curve Stability (Code Base & Servers) Keep Developers moving forward, not bailing water Scale & Feature Completeness Focus Developers on key, measurable roadblocks

  29. First Passing Test Now Process Shift: Measurable Targets, Projected Trend Lines Target Complete Core Functionality Tests, Any Feature (e.g. # clients) Time Any Time (e.g. Alpha) Actionable progress metrics, early enough to react

  30. Process Shift: Load Testing (Before Paying Customers Show Up) Expose issues that only occur at scale Establish hardware requirements Establish play is acceptable @ scale

  31. Client-Server Comparison

  32. TSO: Case Study Outline(Lessons Learned) Poorly designed SP  MP MMP transitions Scaling Team & code size, data set size Build & distribution Architecture: logical & code Visibility: development & operations Testability: development, release, load Multi-Player, Non-determinism Persistent user data vs code/content updates Patching / new content / custom content

  33. User Data • Oops! • Users stored much more data (with much more variance) that we had planned for • Caused many DB failures, city failures • BIG problem: their persistent data has to work, always, across all builds & DB instances • What helped • Regression testing, each build, against live set of user data • What would have helped more • Sanity checks against the DB • Range checks against user data • Better code & architecture support for validation of user data

  34. Patching / New Content / Custom Content • Oops! • Initial Patch budget of 1Meg blown in 1st week of operations • New Content required stronger, more predictable process • Custom Content required infrastructure able to easily add new content, on the fly • Key Issue: all effort had gone into going Live, not creating a sustainable process once Live • Conclusion: designing these in would have been much easier than retrofitting…

  35. Lessons Learned • autoTest: Scripted test clients and instrumented code rock! • Collection, aggregation and display of test data is vital in making decisions on a day to day basis • Lessen the panic • Scale&Break is a very clarifying experience • Stable code&servers greatly ease the pain of building a MMP game • Hard data (notopinion) is both illuminating and calming • autoBuild: make it pushbutton with instant web visibility • Use early, use often to get bugs out before going live • Budget for a strong architect role & a strong design review process for the entire game lifecycle • Scalability, testability, patching & new content & long-term persistence are requirements: MUCH cheaper to design in than frantic retrofitting • KISS principle is mandatory, as is expecting changes

  36. Lessons Learned • Visibility: tremendous volumes of data require automated collection&summarization • Provide drill-down access to details from summary view web pages • Get some people on board who’ve been burned before: a lot of TSO’s pain could have been easily avoided, but little distributed system experience & MMP design issues existed in early phases of project • Fred Brooks, the 31st programmer • Strong tools & process pays off for large teams & long-term operations • Measure & improve your workspace, constantly • Non-determinism is painful & unavoidable • Minimize impact via explicit design support & use strong, constant calibration to understand it

  37. Biggest Wins Code Isolation Scaffolding Tools: Build / Test / Measure, Information Management Pre-Checkin Regression / Load Testing

  38. Biggest Losses • Architecture: Massively peer to peer • Early lack of tools • #ifdef across platform / function “Critical Path” dependencies More Details: www.maggotranch.com/MMP (3 TSO Lessons Learned talks)

More Related