E N D
1. GDCE August 31 - 2001 Optimising for Direct3D Richard Huddy
NVIDIA Corporation
RichardH@nvidia.com
2. GDCE August 31 - 2001
3. GDCE August 31 - 2001 Objectives (unchanged) Run at monitor refresh rate (not just 60Hz) and with sufficient headroom to be able to stay there…
Remember that refresh rates go over 75Hz...
Scale with processor and graphics power
Scalability is probably the toughest problem to overcome on the PC
Include low spec machines without adding an extra 12 months to the schedule
But let’s all drop support for software Rasterizers
And consider H/W TnL only (40M+ TAM already)
4. GDCE August 31 - 2001 How to get there… Looking inside with:
Vtune
Which tells you about what the CPU has been up to…
Build with symbols
Build ‘Release’
Prevent inlining of functions
NVTune (NVIDIA’s “Stats Drivers”)
Which tells you what the driver was told…
This can be very different from what you told the API…
Includes quite a lot of runtime activity to do with forcing known-good state after BeginScene()
5. GDCE August 31 - 2001 Can’t I just use RDTSC? No – because RDTSC tells you how many clock cycles the call took until it returned, not how long the work took to complete
Can I just comment out some of the calls?
Yes – but be careful about the dynamics
6. GDCE August 31 - 2001 Pure Software Optimisation Watch out for C++ constructors etc.
I’ve seen games which are limited by the ability of the CPU to construct matrices and vectors!
stl is fine if you’re careful – but you must be careful
__ftol – look for it in the profile
When you unroll your loops don’t do more than 8
and consider re-rolling them when profiling
Always profile at the lowest possible resolution and without waiting for VBlank
it’s the equivalent of running on hardware from two or three years into the future
7. GDCE August 31 - 2001 AGP writes from CPU Write combining means that AGP writes should look like this:-
8. GDCE August 31 - 2001 AGP reads from CPU
9. GDCE August 31 - 2001 D3D Hardware TnL Principles I Don’t switch VB more than necessary
That remains true for DX8
Don’t switch FVF either
Because that always switches VB
Prefer the most compact FVF
But there are special considerations too
Disable LOCALVIEWER
Unless it’s a clear quality win
10. GDCE August 31 - 2001 Hardware TnL Principles II Be clever in your use of Vertex Buffers
Correct use of flags are critical
The more you tell us the more we know
But don’t get too clever. Implementing your own round-robin scheme is the wrong way to handle it
Tuning for H/W is mostly the same but…
Almost never use ProcessVertices to help out
Sparse VBs don’t hurt hardware (but S/W dies…)
11. GDCE August 31 - 2001 Hardware TnL Principles III Worry less about individual CPU cycles
Send your polys in batches of 200+
Prefer strips - to lists and fans
Degenerate tri’s are handy here
Though won’t matter much for low poly scenes
Indexing is a great idea
It’s the only way to get the cache working for you
12. GDCE August 31 - 2001 Static LOD for all things that are animated
To save CPU cycles when they’re distant
3 Levels of LOD should suffice
Don’t bother with any LOD for most instances
It’s faster to just draw it…
Landscapes
Use no LOD unless you’re writing a flight sim
Impostors need special handling
Because they’re usually drawn in small batches Hardware TnL – Handling LOD
13. GDCE August 31 - 2001 GeForce Principles I The exact FVF to use can depend on the pattern of use…
True random access benefits from n*32 byte formats (see website for more detail)
Sequential access into the VB generally benefits most from small FVF’s
Think of the GPU’s memory cache and this all makes sense…
14. GDCE August 31 - 2001 GeForce Principles II Try to load balance the pixel engine and the vertex engine
Pixels are stamped out in 2x2 blocks per clock
Or in 2x1, or in 1x1 blocks when using heavy MultiTexture
Vertices are cached:
Before TnL (just as raw memory)
After TnL (so re-using these saves vertex engine work)
15. GDCE August 31 - 2001 GeForce Principles III That means that the cost of a vertex shader can be paid once, or multiple times – depending on whether you achieve cache re-use
The only way to save the full pixel cost is to use the Z buffer…
Fast Z rejects are very early
So, draw roughly front to back
That’s a per-object sort, NOT per poly
Use Clear() – not tri’s
Don’t ask for stencil unless you really want it
16. GDCE August 31 - 2001 Win2K considerations Kernel transitions are much more expensive
But are minimized by the runtime’s batching into command queues
You can’t ignore/trap ALT-TAB
AGP support is ‘imperfect’ – see www.geforcefaq.com for some handy tips
You need SP2 installed plus our DX8 drivers for the best experience…
It’s no good profiling on Win2K – this won’t tell you about the real consumers’ experience
17. GDCE August 31 - 2001 Anti-Aliasing Q: How much memory does AA use?
A: The algorithm is simple:
Take the memory for the *non* AA case
Subtract the memory used by the normal Z buffer
Add the memory for a single super colour and super Z buffer
multiply by sample count to get the super scale
That’s 3 normal buffers, 1 super colour buffer and 1 super Z buffer for triple buffering
18. GDCE August 31 - 2001 Anti-Aliasing Q: Why should I care?
A: Because if you run out of local video memory you are likely to have to texture from AGP memory – which is noticeably slower
19. GDCE August 31 - 2001 Vertex Caches Learn what typical hardware can do to help
TNT has a 16 Vertex Cache
GeForce has an effective 10 Vertex Cache
GeForce3 has an effective 18 Vertex Cache
Other cards will vary, but bigger is always better
NVIDIA’s Vertex caches are FIFO’s (not LRU)
Re-use recent indices
‘Recent’ is defined by the cache size
Use highly local winding in your strips
20. GDCE August 31 - 2001 What is “highly local winding”? It’s the scheme which in which you revisit previous vertices to make full use of them before they escape from the cache.
On simple meshes it’s this kind of pattern:
21. GDCE August 31 - 2001
22. GDCE August 31 - 2001 The Limits of Profilers I A profile tells you what the CPU was doing (and it shows you the effects, not the causes).
It doesn’t (usually) tell you:
Why
Who provoked it
Whether it overlapped graphics activity
What you intended…
What would improve things
23. GDCE August 31 - 2001 The Limits of Profilers II They never tell you what the GC was doing
Remember that Lock() is a highly parallel operation
They don’t ever tell you directly:
Why the GC was so busy
Whether the run-time is a real problem, or,
Whether you have been bullying the run-time
What to do about it
24. GDCE August 31 - 2001 How to interpret a profile… There are no “Good” Profiles
Only ever:
“Bad” Profiles
or,
“Not necessarily bad” profiles
25. GDCE August 31 - 2001 Aquanox – System load Driver = 58%
Other32B = 4%
Aquanox = 24%
Promising…
26. GDCE August 31 - 2001 Aquanox – what the driver is doing SpinLock = 48%
The driver is just looping, waiting for space (i.e. the chip is backed-up)
27. GDCE August 31 - 2001 Recent shipping D3D Title Game3DD = 54%
GAME_exe = 4%
Msvcrt = 9%
3D_D3D = 6%
NVDD32 = 12%
?
28. GDCE August 31 - 2001 3DMark “Car chase” - High Detail
29. GDCE August 31 - 2001 3DMark “Lobby scene” - Low Detail
30. GDCE August 31 - 2001 What else? Indexed Tri-Strips are the best primitives that there are…
Use DX8 index buffers
Fewer VBs still
The Stats driver and VTune will tell you about the bad things you may be doing
31. GDCE August 31 - 2001 Why do IHV’s complain about Lock() OK. Let’s try a multiple choice question:
When is locking a RenderTarget slow?
32. GDCE August 31 - 2001 Summary Use VBs and IBs for everything
Never Lock anything that’s not a VB or IB
FB locks cost you CPU time
Batch like crazy
Don’t do anything you don’t need to
And certainly don’t do them twice…
33. GDCE August 31 - 2001 The profile says it’s the runtime… There are several possibilities
Maybe the runtime is handling something badly…
Or maybe:
You’re calling the runtime too often
You’re asking it to do something the GC should handle (like clipping, or transforming)
I’ve never seen a well written, balanced app that is runtime-limited on the GeForce family
34. GDCE August 31 - 2001 The profile says it’s the driver… This can typically be caused by two sorts of behaviour:
Good apps, or
Excessive Locks
Poor use of VBs
Being hardware limited
“App too simple” – like many of the SDK samples
After all, they’re not performance tutorials, they’re all about functionality
35. GDCE August 31 - 2001 It should be in the startup code… Be cautious about using expensive API methods in performance sensitive code:
ValidateDevice()
CreateVB()
DestroyVB()
CreateStateBlock()
AssembleVertexShader()
CreateTexture()
36. GDCE August 31 - 2001 Stencil when you don’t want it If you clear only Z then we have to preserve Stencil - even if you’re not using it. And that means Clear() runs slowly
Read-Modify-Write every pixel
So, don’t clear by drawing polygons
If there’s no stencil value in there then we can write anything we like…
37. GDCE August 31 - 2001 DrawPrimitiveUP How many times do I have to say,
“Use VBs for everything”
There are almost zero cases when avoiding VB’s make sense. If you think you have one of those cases then:
a) You haven’t really - talk to me about it
b) Consult a doctor – it’s as bad as that
38. GDCE August 31 - 2001 Not clipping when using a VShader The Vertex Shader route requires you to turn on or off clipping in the same way as the FF route
The standard pipeline description might (wrongly) be taken as indicating the clipping always happens after the shader, but you need to enable clipping to make it happen.
The GeForce family won’t care…
Because they have an infinite guardband
39. GDCE August 31 - 2001 Making Render Targets fast Use a Z buffer which matches your render target in:
Colour depth,
Width, and,
Height
This may mean you need multiple Z buffers, but they’re worth it
When clearing a render target try to Clear() all of it – not just some part, and don’t draw triangles to clear it
40. GDCE August 31 - 2001 Don’t let your art control your code Since you need to be able to render in sizeable batches you must not allow the art path to break this…
So, a landscape which is made up of tiles with many textures dotted around will be hard to batch
Instead try to get the artists to differentiate areas with a small number of detail textures
41. GDCE August 31 - 2001 Drawing text one character at a time Did I mention batching?
Case I saw:
No VBs for text, drawing around 30 chars per frame
Ran text alone at 160 fps on a Pentium III ~600Mhz
That’s roughly twice as fast as Quake3*
(*) But then Quake 3 does physics, renders the world, casts shadows, handles animations, deals with input, and a few other things…
42. GDCE August 31 - 2001 Failing to optimize for the CPU Because even with modern fast CPU’s most games are still CPU limited
Lock is primarily a CPU cost – because the CPU can’t do anything productive while it waits for the lock
This means that Lock is “cheap” on systems where the CPU is not the bottleneck.
43. GDCE August 31 - 2001 NVTune I Analyze interaction with driver
Works on NVIDIA hardware only
Windows 98/Windows 2000 capable
Hotkey capable
Online help via F1 function key
Logging
Frame Rate Display
Natural Extension to VTune
44. GDCE August 31 - 2001 NVTune II Future enhancements include:
Bottleneck analysis inside chip (I.e. fill bound, memory bound or T&L bound?)
# z-Buffer checks vs. z-buffer writes (depth complexity)
Cross machine (TCP/IP) capture
Logging of clock cycles on a per function basis.
Capture start on frame rate threshold
Your suggestions?
45. GDCE August 31 - 2001 NVTune III NVTune Available free at http://partners.nvidia.com/Marketing/Developer/SwDevStaticPages.nsf/pages/
You must be a registered NVIDIA developer
46. GDCE August 31 - 2001 NVTune in action… Do a demo here…
47. GDCE August 31 - 2001 Questions… ?