optimising for direct3d

1. GDCE August 31 - 2001 Optimising for Direct3D Richard Huddy NVIDIA Corporation RichardH@nvidia.com

2. GDCE August 31 - 2001

3. GDCE August 31 - 2001 Objectives (unchanged) Run at monitor refresh rate (not just 60Hz) and with sufficient headroom to be able to stay there� Remember that refresh rates go over 75Hz... Scale with processor and graphics power Scalability is probably the toughest problem to overcome on the PC Include low spec machines without adding an extra 12 months to the schedule But let�s all drop support for software Rasterizers And consider H/W TnL only (40M+ TAM already)

4. GDCE August 31 - 2001 How to get there� Looking inside with: Vtune Which tells you about what the CPU has been up to� Build with symbols Build �Release� Prevent inlining of functions NVTune (NVIDIA�s �Stats Drivers�) Which tells you what the driver was told� This can be very different from what you told the API� Includes quite a lot of runtime activity to do with forcing known-good state after BeginScene()

5. GDCE August 31 - 2001 Can�t I just use RDTSC? No � because RDTSC tells you how many clock cycles the call took until it returned, not how long the work took to complete Can I just comment out some of the calls? Yes � but be careful about the dynamics

6. GDCE August 31 - 2001 Pure Software Optimisation Watch out for C++ constructors etc. I�ve seen games which are limited by the ability of the CPU to construct matrices and vectors! stl is fine if you�re careful � but you must be careful __ftol � look for it in the profile When you unroll your loops don�t do more than 8 and consider re-rolling them when profiling Always profile at the lowest possible resolution and without waiting for VBlank it�s the equivalent of running on hardware from two or three years into the future

7. GDCE August 31 - 2001 AGP writes from CPU Write combining means that AGP writes should look like this:-

8. GDCE August 31 - 2001 AGP reads from CPU

9. GDCE August 31 - 2001 D3D Hardware TnL Principles I Don�t switch VB more than necessary That remains true for DX8 Don�t switch FVF either Because that always switches VB Prefer the most compact FVF But there are special considerations too Disable LOCALVIEWER Unless it�s a clear quality win

10. GDCE August 31 - 2001 Hardware TnL Principles II Be clever in your use of Vertex Buffers Correct use of flags are critical The more you tell us the more we know But don�t get too clever. Implementing your own round-robin scheme is the wrong way to handle it Tuning for H/W is mostly the same but� Almost never use ProcessVertices to help out Sparse VBs don�t hurt hardware (but S/W dies�)

11. GDCE August 31 - 2001 Hardware TnL Principles III Worry less about individual CPU cycles Send your polys in batches of 200+ Prefer strips - to lists and fans Degenerate tri�s are handy here Though won�t matter much for low poly scenes Indexing is a great idea It�s the only way to get the cache working for you

12. GDCE August 31 - 2001 Static LOD for all things that are animated To save CPU cycles when they�re distant 3 Levels of LOD should suffice Don�t bother with any LOD for most instances It�s faster to just draw it� Landscapes Use no LOD unless you�re writing a flight sim Impostors need special handling Because they�re usually drawn in small batches Hardware TnL � Handling LOD

13. GDCE August 31 - 2001 GeForce Principles I The exact FVF to use can depend on the pattern of use� True random access benefits from n*32 byte formats (see website for more detail) Sequential access into the VB generally benefits most from small FVF�s Think of the GPU�s memory cache and this all makes sense�

14. GDCE August 31 - 2001 GeForce Principles II Try to load balance the pixel engine and the vertex engine Pixels are stamped out in 2x2 blocks per clock Or in 2x1, or in 1x1 blocks when using heavy MultiTexture Vertices are cached: Before TnL (just as raw memory) After TnL (so re-using these saves vertex engine work)

15. GDCE August 31 - 2001 GeForce Principles III That means that the cost of a vertex shader can be paid once, or multiple times � depending on whether you achieve cache re-use The only way to save the full pixel cost is to use the Z buffer� Fast Z rejects are very early So, draw roughly front to back That�s a per-object sort, NOT per poly Use Clear() � not tri�s Don�t ask for stencil unless you really want it

16. GDCE August 31 - 2001 Win2K considerations Kernel transitions are much more expensive But are minimized by the runtime�s batching into command queues You can�t ignore/trap ALT-TAB AGP support is �imperfect� � see www.geforcefaq.com for some handy tips You need SP2 installed plus our DX8 drivers for the best experience� It�s no good profiling on Win2K � this won�t tell you about the real consumers� experience

17. GDCE August 31 - 2001 Anti-Aliasing Q: How much memory does AA use? A: The algorithm is simple: Take the memory for the *non* AA case Subtract the memory used by the normal Z buffer Add the memory for a single super colour and super Z buffer multiply by sample count to get the super scale That�s 3 normal buffers, 1 super colour buffer and 1 super Z buffer for triple buffering

18. GDCE August 31 - 2001 Anti-Aliasing Q: Why should I care? A: Because if you run out of local video memory you are likely to have to texture from AGP memory � which is noticeably slower

19. GDCE August 31 - 2001 Vertex Caches Learn what typical hardware can do to help TNT has a 16 Vertex Cache GeForce has an effective 10 Vertex Cache GeForce3 has an effective 18 Vertex Cache Other cards will vary, but bigger is always better NVIDIA�s Vertex caches are FIFO�s (not LRU) Re-use recent indices �Recent� is defined by the cache size Use highly local winding in your strips

20. GDCE August 31 - 2001 What is �highly local winding�? It�s the scheme which in which you revisit previous vertices to make full use of them before they escape from the cache. On simple meshes it�s this kind of pattern:

21. GDCE August 31 - 2001

22. GDCE August 31 - 2001 The Limits of Profilers I A profile tells you what the CPU was doing (and it shows you the effects, not the causes). It doesn�t (usually) tell you: Why Who provoked it Whether it overlapped graphics activity What you intended� What would improve things

23. GDCE August 31 - 2001 The Limits of Profilers II They never tell you what the GC was doing Remember that Lock() is a highly parallel operation They don�t ever tell you directly: Why the GC was so busy Whether the run-time is a real problem, or, Whether you have been bullying the run-time What to do about it

24. GDCE August 31 - 2001 How to interpret a profile� There are no �Good� Profiles Only ever: �Bad� Profiles or, �Not necessarily bad� profiles

25. GDCE August 31 - 2001 Aquanox � System load Driver = 58% Other32B = 4% Aquanox = 24% Promising�

26. GDCE August 31 - 2001 Aquanox � what the driver is doing SpinLock = 48% The driver is just looping, waiting for space (i.e. the chip is backed-up)

27. GDCE August 31 - 2001 Recent shipping D3D Title Game3DD = 54% GAME_exe = 4% Msvcrt = 9% 3D_D3D = 6% NVDD32 = 12% ?

28. GDCE August 31 - 2001 3DMark �Car chase� - High Detail

29. GDCE August 31 - 2001 3DMark �Lobby scene� - Low Detail

30. GDCE August 31 - 2001 What else? Indexed Tri-Strips are the best primitives that there are� Use DX8 index buffers Fewer VBs still The Stats driver and VTune will tell you about the bad things you may be doing

31. GDCE August 31 - 2001 Why do IHV�s complain about Lock() OK. Let�s try a multiple choice question: When is locking a RenderTarget slow?

32. GDCE August 31 - 2001 Summary Use VBs and IBs for everything Never Lock anything that�s not a VB or IB FB locks cost you CPU time Batch like crazy Don�t do anything you don�t need to And certainly don�t do them twice�

33. GDCE August 31 - 2001 The profile says it�s the runtime� There are several possibilities Maybe the runtime is handling something badly� Or maybe: You�re calling the runtime too often You�re asking it to do something the GC should handle (like clipping, or transforming) I�ve never seen a well written, balanced app that is runtime-limited on the GeForce family

34. GDCE August 31 - 2001 The profile says it�s the driver� This can typically be caused by two sorts of behaviour: Good apps, or Excessive Locks Poor use of VBs Being hardware limited �App too simple� � like many of the SDK samples After all, they�re not performance tutorials, they�re all about functionality

35. GDCE August 31 - 2001 It should be in the startup code� Be cautious about using expensive API methods in performance sensitive code: ValidateDevice() CreateVB() DestroyVB() CreateStateBlock() AssembleVertexShader() CreateTexture()

36. GDCE August 31 - 2001 Stencil when you don�t want it If you clear only Z then we have to preserve Stencil - even if you�re not using it. And that means Clear() runs slowly Read-Modify-Write every pixel So, don�t clear by drawing polygons If there�s no stencil value in there then we can write anything we like�

37. GDCE August 31 - 2001 DrawPrimitiveUP How many times do I have to say, �Use VBs for everything� There are almost zero cases when avoiding VB�s make sense. If you think you have one of those cases then: a) You haven�t really - talk to me about it b) Consult a doctor � it�s as bad as that

38. GDCE August 31 - 2001 Not clipping when using a VShader The Vertex Shader route requires you to turn on or off clipping in the same way as the FF route The standard pipeline description might (wrongly) be taken as indicating the clipping always happens after the shader, but you need to enable clipping to make it happen. The GeForce family won�t care� Because they have an infinite guardband

39. GDCE August 31 - 2001 Making Render Targets fast Use a Z buffer which matches your render target in: Colour depth, Width, and, Height This may mean you need multiple Z buffers, but they�re worth it When clearing a render target try to Clear() all of it � not just some part, and don�t draw triangles to clear it

40. GDCE August 31 - 2001 Don�t let your art control your code Since you need to be able to render in sizeable batches you must not allow the art path to break this� So, a landscape which is made up of tiles with many textures dotted around will be hard to batch Instead try to get the artists to differentiate areas with a small number of detail textures

41. GDCE August 31 - 2001 Drawing text one character at a time Did I mention batching? Case I saw: No VBs for text, drawing around 30 chars per frame Ran text alone at 160 fps on a Pentium III ~600Mhz That�s roughly twice as fast as Quake3* (*) But then Quake 3 does physics, renders the world, casts shadows, handles animations, deals with input, and a few other things�

42. GDCE August 31 - 2001 Failing to optimize for the CPU Because even with modern fast CPU�s most games are still CPU limited Lock is primarily a CPU cost � because the CPU can�t do anything productive while it waits for the lock This means that Lock is �cheap� on systems where the CPU is not the bottleneck.

43. GDCE August 31 - 2001 NVTune I Analyze interaction with driver Works on NVIDIA hardware only Windows 98/Windows 2000 capable Hotkey capable Online help via F1 function key Logging Frame Rate Display Natural Extension to VTune

44. GDCE August 31 - 2001 NVTune II Future enhancements include: Bottleneck analysis inside chip (I.e. fill bound, memory bound or T&L bound?) # z-Buffer checks vs. z-buffer writes (depth complexity) Cross machine (TCP/IP) capture Logging of clock cycles on a per function basis. Capture start on frame rate threshold Your suggestions?

45. GDCE August 31 - 2001 NVTune III NVTune Available free at http://partners.nvidia.com/Marketing/Developer/SwDevStaticPages.nsf/pages/ You must be a registered NVIDIA developer

46. GDCE August 31 - 2001 NVTune in action� Do a demo here�

47. GDCE August 31 - 2001 Questions� ?

optimising for direct3d

optimising for direct3d

Presentation Transcript

GPU Programming Seminar 3 Direct3D

Direct3D Transformation

Direct3D Part 2

Optimising outcomes for potentially resectable patients

Direct3D 9

Optimising future opportunities for your child

Massive virtual textures for games: Direct3D tiled resources

Optimising TCO

Optimising SharePoint For Internet Sites

Direct3D Workshop

Using Direct3D 10

Optimising Transformations for Hardware Compilation

Direct3D Workshop

Optimising MOSS 2007 for the Internet

Ideas for optimising NERI modelling

DIRECT3D

Direct3D New Rendering Features

SEO Basics: Optimising for Duckduckgo

Optimising Meta Descriptions for Better Traffic

Optimising Enforcement