210 likes | 345 Views
The challenge of migration : desktop to handheld Phil Atkin Product Manager 3D Graphics September 2004. Topics. Overview Definitions What does ‘desktop’ mean? What does ‘handheld’ mean? Challenges Management of 3D resources Management of CPU resources Case study
E N D
The challenge of migration : desktop to handheldPhil AtkinProduct Manager 3D Graphics September 2004
Topics Overview • Definitions • What does ‘desktop’ mean? • What does ‘handheld’ mean? • Challenges • Management of 3D resources • Management of CPU resources • Case study • Realities of porting a desktop 3D framework to handheld • Demonstrations (Intel / Intrinsyc Carbonado) • Performance (PowerBook vs. Carbonado) • Conclusions
Desktop vs. handheld systems • Desktop system • CPU + GPU + 3D API • Powerful - 1GHz up to >3GHz CPU with SIMD floating-point • Big caches • Minimum ‘Free3D’ chipset • Maximum GeForce 6800 / Radeon X800 • OpenGL 1.5 transitioning to OpenGL 2.0 • Handheld system (PowerVR 3D) • CPU + GPU + 3D API • CPU ranges from 100MHz to 500+MHz • Small caches • CPU may or may not have FP capability • Minimum MBX Lite no VGP - 1M tris, 100M pixels • Maximum MBX VGP - 4M tris, 350M pixels, free AA • OpenGL ES 1.0 transitioning to OpenGL ES 1.1
Handheld 3D • Delivering accelerated handheld 3D is all about power management • All chip vendors have access to similar process technologies • Leads to similar power / MHz • Leads to similar performance / mW • All system vendors have access to the similar battery technologies • Leads to similar ‘talk time / game-time’ per recharge • Some architectures have clear power/performance advantages • Tile-based rendering, on-die framebuffers - minimize data passing between chips • These factors lead to a relatively narrow spectrum of capabilities • Low-end and high-end systems only differ by 3-4x • Admittedly PowerVR sets a high baseline, but the generalization holds
Observations • Even low-end handheld 3D accelerators will offer excellent performance • On par with 2nd / 3rd generation desktop accelerators • Efficient API is in place and standardized • Hence the path from the driver to the hardware is sorted - but … • What about the path from the application to the driver? • How to structure application code to keep hardware busy? • Despite relatively narrow spectrum of 3D capabilities • Potential for extremely large disparity between systems • Floating point-less CPU, rasterizer-only 3D • Very high performance CPU / FPU, vertex-programmable 3D • How to develop or port with such a spread of computational capabilities?
The challenge • Management of 3D capabilities is not the challenge • The usual techniques learned in the desktop space can be used • Resolution / triangle count / texture filtering / AA quality • Management of CPU resources is the challenge • Lowering vertex counts to GPU will inherently lower CPU load • But the problem is far bigger in scope than just this • The data type float is essentially unavailable at the low end • Platform CPUs have such diverse capabilities - either • Stratify in software, code explicitly to each market stratum • Or code in a floating-point agnostic manner • The latter is achievable and allows a single code base across platforms
Why bother porting to an FPU-less platform? • Consider the following 3 likely classes of handheld device • Class A • High-performance CPU, FPU, GPU with vertex processing • Class B • High-performance CPU, GPU with vertex processing • Class C • CPU, rasterizer • Classes B and C will likely be smaller die, lower cost • Will likely ship in higher volumes • If so - • will offer more revenue opportunities for software vendors • yet platforms do not have floating-point capability • But a Class A device may win out • Software vendors must cover all the bases to guarantee success
Why not just make everything fixed point? • Because your desktop platform • Will be faster in floating-point • Does not have fixed-point OpenGL ES entrypoints! • If you really need • The same code base to run on desktop and handheld • High performance on all classes of handheld systems • You need to abstract out your numeric format • C++ class, build-time switchable from 16.16 to float
Porting desktop software - 4 step program • Observations • Debugging on a handheld is no fun • The porting process needs to be derisked as much as possible • Strive to get as close as possible to the handheld codebase without leaving the desktop • Code extremely defensively - make no assumptions regarding performance • ‘Portification’ • Yes, I know it’s not a real word… • The process of preparing for the port without actually executing on it • Step 1 - implement the abstracted real number class • Step 2 - portify 3D code • Step 3 - portify application code • Step 4 - do the port
Step 1 - implement real number class • C++ operators for +-*/ and type conversion • Note ARM does not have a divide instruction • Recommendation - normalize / reciprocate / multiply / denormalize • ARM does have a normalize instruction - CLZ • Functions for common but expensive operations • E.g. implement your own sqrt and trig • Why - because you may wish to sidestep glRotate() etc. • These functions will of course work in fixed or float • Hence testability on desktop is high and immediate
Step 2 - portify 3D code • Isolate your 3D code if not already done • Minimize #include <gl/gl.h> • Modify 3D code so it is OpenGL / OpenGL ES agnostic • Modify it so it is floating point / fixed point agnostic • And obviously modify your data too • Make your world representable by 16.16
Step 3 - portify application code • Work out what maths absolutely must be floating-point • Replace everything else with real number class • But be really careful - for example • Really common case - distance between 2 points - Pythagoras • Squaring those numbers will blow up for almost all cases • Code defensively - implement a ‘radius’ function that will not blow up • OK, you could keep this example as floats • But floats are so very expensive without FPU • It’s a common operation, and it’s easy to get it right in fixed-point • Remember - conservation of CPU cycles is the challenge • The hardware developers and Khronos have taken care of the 3D • CPU cycles are precious, conserve them
Step 4 - port to the handheld platform • This step is really easy if the last 3 went well ... • Take cross-compiler • Turn on all the #ifdefs you prepared earlier • Type ‘make’ • Or under Embedded Visual C++ hit F7 • It will just work. Trust me, it will.
Case study - the Mobile Scene Graph • Framework for 3D applications • Initial implementation - desktop • Interactive landscape, architecture and garden design review • Straightforward design • Classic app + cull + draw, frustum culling • C++, STL, polymorphic, RTTI • Target platform PowerBook G3 500MHz / OpenGL / glut • Transitioned into • Desktop - interactive landscape, architecture and garden design review • Handheld - experimental testbed for OpenGL ES rendering • Target platforms • PowerBook G3 500MHz / OpenGL 1.4 / glut • Intel / Intrinsyc Carbonado / OpenGL ES 1.0 / egl • Great opportunity to take on a port • Aiming for 100% application source code compatibility • Aiming to deliver highest possible performance on desktop and handheld
MSG Implementation details • ‘MSGReal’ • Build-time switchable float or OpenGL ES 16.16 fixed point • C++ operators provide +-*/ and common type conversions • Functions provide trig, sqrt / recipsqrt • All expensive operations implemented by piecewise quadratics • Additional 4.12 ‘MSGShortFix’ type • Intermediate product fits into 32 bits, no double-length maths • Superbright unclamped colour accumulation • Reflection-mapping via quadratic approximation without overflow • Only 2 internal functions use floating-point • Plane fitter for frustum construction • Determinant calculation in matrix inverter
Porting realities - timescales • Approximately 3 man-months of portification • Difficult to measure accurately • Coding was in progress as portification began • Approximately 20,000 lines of code • Only 800 lines can see <gl/gl.h> • Just 8 #ifdefs in this module • i.e.if this is representative, the portification process is manageable • 2 evening porting sessions • Just 6 hours at the desk from ‘move code onto PC’ to ‘run on handheld’ • … and one evening should have been enough • Then performance tuning • Anticipated >30Hz was only 15-20Hz • Now tuned up to >40Hz with no change in geometric load
Porting realities - gotchas • Handheld specific • Performance not linear with clock for a variety of reasons • e.g. caching behaviour, driver behaviour, architectural • Limited container class and template support • Some C++ operations will hurt more than you expect • Very slow RTTI • STL list operations sort(), push_back(), pop_front() proved surprisingly expensive • 3D gotchas • Unanticipated differences in behaviour • E.g. multiple strips from single pointer setup – multiple TnL on Carbonado • Would benefit from gLDrawMultiElements • Short tristrip performance • Would benefit from gLDrawMultiElements!! • Best performance - glDrawElements(glTriangles) • Fixed-point to integer conversion in OpenGL ES interface
Demonstrations • MSGRefMap - arithmetic performance test • Single object, reflection mapped • Cull time virtually zero • Virtually all cycles spent in reflection-map code • This is fixed-point on all platforms • 16-bit skybox textures • MSGHurricane - frustum-culling test • 2048 objects in hierarchical terrain • unlit, 8-bit luminance texture • 7 animated aircraft • lit with 2 lights • 16-bit aircraft texture • 16-bit skybox textures
Performance • MSGRefMap • PowerBook floating point • OpenGL renderer - 116 Hz • NULL renderer - 1360 Hz • PowerBook fixed point • NULL renderer - 1620 Hz • Carbonado fixed point • OpenGL ES renderer - 35.9 Hz • NULL renderer - 668.4 Hz • Carbonado floating point • NULL renderer - 101.2 Hz • MSGHurricane • PowerBook floating point • OpenGL renderer - 122 Hz • NULL renderer - 1890 Hz • PowerBook fixed point • NULL renderer - 960 Hz • Carbonado fixed point • OpenGL ES renderer - 34.6 Hz • NULL renderer - 271.5 Hz • Carbonado floating point • NULL renderer - 46.25 Hz • Fixed-point code averages 6x faster than FP emulation • Despite data structure traversal and other non-arithmetic code • Despite fixed point reflection-mapping code in floating point version • This is a fast CPU, yet it is too slow in FP emulation running MSGHurricane
Last word on performance • The missing case - • Floating point application code • Fixed point framework / middleware • Estimated by isolating application cycles on Carbonado • Time spent in application = 11% of frame time (NULL renderer) • MSGHurricane • Fixed point frame time = 0.0037 sec • Floating point frame time = 0.021 sec • Mixed-mode frame = (89% * 0.0037) + (11% * 0.021) = 0.011 sec • Estimated 88Hz mixed-mode rate • Within 33mS budget • But scale processor back to 150MHz and it becomes too slow again • And this is just a demo - just splines, no physics, no gameplay • Floating-point emulation is just too slow for even the simplest case
Conclusions • The software migration process can be relatively painless • Source code should be ‘portified’ - i.e. made • 3D API agnostic • Isolate and encapsulate your 3D API interactions • Structure desktop code to be OpenGL ES friendly • Floating point agnostic • Abstract out your real number format • At minimum in middleware layer • Ideally allow fixed-point from application down to hardware • You can do all this from the safety of your workstation • No handheld platform debugging until project is mature • MSG ported to Carbonado in 2 evenings with just printf • And if you get it right • It will just port and just work - but may require some tuning • Performance will be high across platforms • Resulting software will be highly portable and reusable