330 likes | 482 Views
C++ on Next-Gen Consoles: Effective Code for New Architectures. Pete Isensee Development Manager Microsoft Game Technology Group. Last Year at GDC. Chris Hecker ranted What did he say? Programmers: danger ahead Out-of-order execution: good In-order execution: bad
E N D
C++ on Next-Gen Consoles:Effective Code for New Architectures Pete Isensee Development Manager Microsoft Game Technology Group
Last Year at GDC • Chris Hecker ranted • What did he say? • Programmers: danger ahead • Out-of-order execution: good • In-order execution: bad • Microsoft and Sony are going to screw you • You are so hosed. Game over, man. • “There’s absolutely nothing you can do about this”
Console Hardware Architectures • Optimized to do floating-point math • Optimized for multithreaded tasks • Optimized to run games • Not optimized to run general purpose code • Not optimized to do branch prediction, code reordering, instruction pipelining or other out-of-order magic • Large L2 caches • Large latencies
We’re Game Programmers.We Love Challenges. • We will make games on these consoles • The solution is not assembly language • The solution is to tailor our C/C++ engines, inner loops and bottleneck functions to the realities of the hardware • Remember: C++ code can make or break your game’s performance
Not Covering • Profiling(do it) • Multithreading(do it) • Memory allocation(avoid in game loop) • Compiler settings(experiment) • Exception handling(avoid it)
Topics for Today • Thinking about L2 • Optimize memory access • Use CPU caches effectively • Thinking about in-order processing • Avoid function call overhead • Tips for efficient math • Avoid hidden C++ inefficiencies
Optimize Memory Access • Proverb: thou shalt treat memory as if it were thy hard drive • You will be memory-bound on new consoles • Recommendations • Never read from the same place twice in a frame • Read data sequentially • Write data sequentially • Use everything you read
Minimize Data Passes • Game frame loops often access data twice • Or three times • Or more • Optimize for a single pass • Consider less frequent operations • AI • Physics, collision • Networking • Particle systems Multiple Pass Architecture
Pointer Aliasing Explained void init( float *a, const float *b ) { a[0] = 1.0f - *b; a[1] = 1.0f - *b; } Nominal case Worst case float a[2]={0.0f}; init( a, &a[0] ); 0.0 0.0 1.0 0.0 1.0 b a 0.0 1.0 0.0 0.0 a b
A Solution: Restrict • Restrict keyword tells the compiler there’s no aliasing • Restrict permits the compiler to generate much more efficient code void init( float* __restrict a, const float* __restrict b ) { a[0] = 1.0f - *b; // compiler can do a[1] = 1.0f - *b; // the right thing }
What to Restrict • Use restrict widely • Function pointer parameters • Local pointers • Pointers in structs/classes • But not: • Function return types • Casts • Global pointers (maybe) • References (maybe)
Use the CPU Caches Effectively • The L2 cache is your best friend • Using the cache well is an art • Ensure you have a good profiler by your side
Keep the Working Set Small • Pack commonly used data together • Frequently used data might deserve its own struct/class • Keep rarely used data separate • Example: texture file names • Consider bitfields • Bitfields are extremely efficient on PowerPC • Consider other forms of lossless compression
Inefficient Structs Are Bad Mojo struct InefficientCar { bool manual; // padding here wheel wheels[8]; // 8 wheels? bool convertible; // more pad char engine; // 4 bits used char file[32]; // rarely used double maxAccel; // double? }; sizeof(InefficientCar) = 80
Carefully Design Structures struct EfficientCar { wheel wheels[4]; // 4 wheels wheel *moreWheels; char *file; // stored elsewhere float maxAccel; // float unsigned engine:4; // bitfields unsigned manual:1; unsigned convertible:1; }; sizeof(EfficientCar) = 32
Choose the Right Container • Prefer contiguous containers • Or at least mostly contiguous • Examples: array, vector, deque • Avoid node-based containers • List, set/map, binary trees, hash tables • If you must use a tree, consider a custom allocator for memory locality • Vector + std::sort is often faster (and smaller) than set or map or hash tables, by an order of magnitude
Avoid Function Call Overhead • Function call overhead was a surprising cause of performance issues on Xbox • The same is true on Xbox 360 and PS3 • Fortunately, there are lots of solutions • Research compiler settings. On Xbox 360: • Inline “any suitable” • Enable link-time code generation • Spend time ensuring the compiler is inlining the right things
Avoid Virtual Functions • Weigh the limitations of virtual functions • Adds a branch instruction • Branch is always mispredicted • Compiler is limited in how it can optimize • Consider replacing • virtual void Draw() = 0; • With • Xbox360.cpp: void Draw() { ... } • Windows.cpp: void Draw() { ... } • PS3.cpp: void Draw() { ... }
Maximize Leaf Functions • Leaf functions don’t call other functions, ever • If a potential leaf function calls another function, the high-level function: • Is much less likely to be inlined • Must set up a stack frame • Must set up registers • Potential solutions • Remove the inner function completely • Inline the inner function • Provide two versions of the outer function
Unroll Inner Loops • Compiler can’t unroll loops where n is variable • Even unrolling from ++i to i+=4 can be a significant gain • Eliminates three branch instructions • Increases opportunity for code scheduling • Don’t forget to hoist invariants out, too
Example Unrolling // original for( i=a.beg(); i!=a.end(); ++i ) process(i); // unrolled e = a.end(); for( i=a.beg(); i!=e; i+=4 ) { process(i); process(i+1); process(i+2); process(i+3); }
Pass Native Types by Value • Tradition says that “large” types are passed by pointer or reference, but be careful • New consoles have really large registers • Native types include • 64-bit int (__int64) • VMX vector (__vector4) – 128 bits! • Pass structs by pointer or reference • One exception: pass structs consisting of bitfields <= 64 bits by value
Know Data Type Performance • int32 and int64 have equivalent perf • float and double have equivalent perf • int8 and int16 are slower than int • They generate extra instructions • High bits cleared or sign-extended • Example: int32 adds 2X faster than int16 adds • Recommendations • Store as smallest type required • Load into int32, int64 or double for calculations
Use Native Vector Types • In CS 101, you learned to create abstract data types, such as matrices typedef std::vector<float,4> vec; typedef std::vector<vec,4> matrix; • This code is an abomination • At least on Xbox 360 and PS3 • Xbox 360 and PS3 have dedicated vector math units called VMX units • Use them!
Your Math Buddies • __vector4 (4 32-bit floats; 128-bit register) • XMVECTOR (typedef for vector4) • XMMATRIX (array of 4 vector4s) • XMVECTOR operators (+,-,*,/) • Hundreds of XMVECTOR and XMMATRIX functions • Xbox 360-specific, but similar constructs in PS3 compilers
Avoid Floating-Point Branches • FP branches are slow • Cache has to be flushed • ~10X slower than int branches • Avoid loops with float test expressions • Eliminate altogether if possible • Can be faster to calculate values you won’t use! • Compare integers instead • Replace with fsel when possible • 10-20X performance gain
The fsel Option in Detail • Definition of hardware implementation: float fsel(float a, float b, float c) { return ( a < 0.0f ) ? b : c; } • You can replace expressions like • v = ( w < x ) ? y : z; // slow • With faster expressions like • v = fsel( w - x, y, z ); // turbo
Prefer Platform-Specific Funcs • The C runtime (CRT) is not usually the best option when performance matters • Xbox 360 examples • Prefer CreateFile to fopen or C++ streams • Options for asynchronous reads and other goodness • Prefer XMemCpy to memcpy • 2-6X faster • Prefer XMemSet to memset • 8-14X faster
Avoid Hidden C++ Inefficiencies • C++ rocks the house! • C++ can bring your game to its knees! • Consider these innocuous snippets • Quaternion q; • s.push_back( k ); • if( (float)i > f ) • obj->Draw(); • GameObject arr[1000]; • a = b + c; • i++;
C++ is Dangerous • With power comes responsibility • Beware constructors • Is initialization the right thing to do? • Beware hidden allocations • Conversion casts may have significant cost • Use virtual functions with care • Beware overloaded operators • Stick to known idioms • Operator++ should be a constant-time operation. • Really.
Summary • There absolutely are many things you can do to efficiently program next-gen consoles • Two key issues: L2/memory and in-order processing • Treat memory as you would a hard disk • Watch out for those branches; use tricks like fsel • Prefer a light C++ touch
What’s Next • Our games are only as good as the weakest member of the team • Share what you’ve learned • “The sharing of ideas allows us to stand on one another’s shoulders instead of on one another’s feet” – Jim Warren
Questions • pkisensee@msn.com • Fill out your feedback forms