1 / 39

Common C++ Performance Mistakes in Games

This talk discusses common performance mistakes in game development using C++. Topics include issues with the STL, NIH syndrome, tool knowledge, C runtime, and function calls.

dmarquez
Download Presentation

Common C++ Performance Mistakes in Games

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Common C++ Performance Mistakes in Games Pete Isensee Xbox Advanced Technology Group

  2. About the Data • ATG reviews code to find bottlenecks and make perf recommendations • 50 titles per year • 96% use C++ • 1 in 3 use “advanced” features like templates or generics

  3. Why This Talk Is Important • The majority of Xbox games are CPU bound • The CPU bottleneck is often a language or C++ library issue • These issues are not usually specific to the platform

  4. Format • Definition of the problem • Examples • Recommendation • For reference • A frame is 17 or 33 ms (60fps / 30fps) • Bottlenecks given in ms per frame

  5. Issue: STL • Game using std::list • Adding ~20,000 objects every frame • Rebuilding the list every frame • Time spent: 6.5 ms/frame! • ~156K overhead (2 pointers per node) • Objects spread all over the heap

  6. std::set and map • Many games use set/map as sorted lists • Inserts are slow (log(N)) • Memory overhead: 3 ptrs + color • Worst case in game: 3.8 ms/frame

  7. std::vector • Hundreds of push_back()s per frame • VS7.1 expands vector by 50% • Question: How many reallocations for 100 push_back()s? • Answer: 13! (1,2,3,4,5,7,10,14,20,29,43,64,95)

  8. Clearly, the STL is Evil

  9. Use the Right Tool for the Job • The STL is powerful, but it’s not free • Filling any container is expensive • Be aware of container overhead • Be aware of heap fragmentation and cache coherency • Prefer vector, vector::reserve()

  10. The STL is Evil, Sometimes • The STL doesn’t solve every problem • The STL solves some problems poorly • Sometimes good old C-arrays are the perfect container • Mike Abrash puts it well: • “The best optimizer is between your ears”

  11. Issue: NIH Syndrome • Example: Custom binary tree • Sorted list of transparent objects • Badly unbalanced • 1 ms/frame to add only 400 items • Example: Custom dynamic array class • Poorer performance than std::vector • Fewer features

  12. Optimizations that Aren’t void appMemcpy( void* d, const void* s, size_t b ) { // lots of assembly code here ... } appMemcpy( pDest, pSrc, 100 ); // bottleneck • appMemcpy was slower than memcpy for anything under 64K

  13. Invent Only What You Need • std::set/map more efficient than the custom tree by 10X • Tested and proven • Still high overhead • An even better solution • Unsorted vector or array • Sort once • 20X improvement

  14. Profile • Run your profiler • Rinse. Repeat. • Prove the improvement. • Don’t rewrite the C runtime or STL just because you can. There are more interesting places to spend your time.

  15. Issue: Tool Knowledge • If you’re a programmer, you use C/C++ every day • C++ is complex • CRT and STL libraries are complex • The complexities matter • Sometimes they really matter

  16. vector::clear • Game reused global vector in frame loop • clear() called every frame to empty the vector • C++ Standard • clear() erases all elements (size() goes to 0) • No mention of what happens to vector capacity • On VS7.1/Dinkumware, frees the memory • Every frame reallocated memory

  17. Zero-Initialization struct Array { int x[1000]; }; struct Container { Array arr; Container() : arr() { } }; Container x; // bottleneck • Costing 3.5 ms/frame • Removing : arr() speeds this by 20X

  18. Know Thine Holy Standard • Use resize(0) to reduce container size without affecting capacity • T() means zero-initialize PODs. Don’t use T() unless you mean it. • Get a copy of the C++ Standard. Really. • www.techstreet.com; search on 14882 • Only $18 for the PDF

  19. Issue: C Runtime void BuildScore( char* s, int n ) { if( n > 0 ) sprintf( s, “%d”, n ); else sprintf( s, “” ); } • n was often zero • sprintf was a hotspot

  20. qsort • Sorting is important in games • qsort is not an ideal sorting function • No type safety • Comparison function call overhead • No opportunity for compiler inlining • There are faster options

  21. Clearly, the CRT is Evil

  22. Understand Your Options • itoa() can replace sprintf( s, “%d”, n ) • *s = ‘\0’ can replace sprintf( s, “” ) • std::sort can replace qsort • Type safe • Comparison can be inlined • Other sorting options can be even faster: partial_sort, partition

  23. Issue: Function Calls • 50,000-100,000 calls/frame is normal • At 60Hz, Xbox has 12.2M cycles/frame • Function call/return averages 20 cycles • A game calling 61,000 functions/frame spends 10% CPU (1.7 ms/frame) in function call overhead

  24. Extreme Function-ality • 120,000 functions/frame • 140,000 functions/frame • 130,000 calls to a single function/frame (ColumnVec<3,float>::operator[]) • And the winner: • 340,000 calls per frame! • 9 ms/frame of call overhead

  25. Beware Elegance • Elegance → levels of indirection → more functions → perf impact • Use algorithmic solutions first • One pass through the world • Better object rejection • Do AI/physics/networking less often than once/frame

  26. Inline Judiciously • Remember: inline is a suggestion • Try “inline any suitable” compiler option • 15 to 20 fps • 68,000 calls down to 47,000 • Try __forceinline or similar keyword • Adding to 5 funcs shaved 1.5 ms/frame • Don’t over-inline

  27. Issue: for loops // Example 1: Copy indices to push buffer for( DWORD i = 0; i < dwIndexCnt; ++i ) *pPushBuffer++ = arrIndices[ i ]; // Example 2: Initialize vector array for( DWORD i = 0; i < dwMax; ++i ) mVectorArr[i] = XGVECTOR4(0,0,0,0); // Example 3: Process items in world for( itr i = c.begin(); i < c.end(); ++i ) Process( *i );

  28. Watch Out For For • Never copy/clear a POD with a for loop • std::algorithms are optimized; use them memcpy( pPushBuffer, arrIndices, dwIndexCnt * sizeof(DWORD) ); memset( mVectorArr, 0, dwMax * sizeof(XGVECTOR4) ); for_each( c.begin(), c.end(), Process );

  29. Issue: Exception Handling • Most games never throw • Most games never catch • Yet, most games enable EH • EH adds code to do stack unwinding • A little bit of overhead to a lot of code • 10% size increase is common • 2 ms/frame in worst case

  30. Disable Exception Handling • Don’t throw or catch exceptions • Turn off the C++ EH compiler option • For Dinkumware STL • Define “_HAS_EXCEPTIONS=0” • Write empty _Throw and _Raise_handler; see stdthrow.cpp and raisehan.cpp in crt folder • Add #pragma warning(disable: 4530)

  31. Issue: Strings • Programmers love strings • Love hurts • ~7000 calls to stricmp in frame loop • 1.5 ms/frame • Binary search of a string table • 2 ms/frame

  32. Avoid strings • String comparisons don’t belong in the frame loop • Put strings in an table and compare indices • At least optimize the comparison • Compare pointers only • Prefer strcmp to stricmp

  33. Issue: Memory Allocation • Memory overhead • Xbox granularity/overhead is 16/16 bytes • Overhead alone is often 1+ MB • Too many allocations • Games commonly do thousands of allocations per frame • Cost: 1-5 ms/frame

  34. Hidden Allocations • push_back(), insert() and friends typically allocate memory • String constructors allocate • Init-style calls often allocate • Temporary objects, particularly string constants that convert to string objects

  35. Minimize Per-Frame Allocations • Use memory-friendly data structures, e.g. arrays, vectors • Reserve memory in advance • Use custom allocators • Pool same-size allocations in a single block of memory to avoid overhead • Use the explicit keyword to avoid hidden temporaries • Avoid strings

  36. Other Tidbits • Compiler settings: experiment • dynamic_cast: just say no • Constructors: performance killers • Unused static array space: track this • Loop unrolling: huge wins, sometimes • Suspicious comments: watch out • “Immensely slow matrix multiplication”

  37. Use the Right Tool for the Job The STL is Evil, Sometimes Invent Only What You Need Profile Know Thine Holy Standard Understand Your Options Beware Elegance Inline Judiciously Watch Out For For Disable Exception Handling Avoid Strings Minimize Per-frame Allocations Wrap Up

  38. Call to Action: Evolve! • Pass the rubber chicken • Share your C++ performance mistakes with your team • Mentor junior programmers • So they only make new mistakes • Don’t stop learning • You can never know enough C++

  39. Questions • Fill out your feedback forms • Email: pkisensee@msn.com • This presentation: www.tantalon.com/pete.htm

More Related