1 / 26

Harnessing the Horsepower of OpenGL ES Hardware Acceleration

Harnessing the Horsepower of OpenGL ES Hardware Acceleration. Rob Simpson, Bitboys Oy. robert.simpson@bitboys.com. Outline. Why hardware? What is in the hardware Architecture of embedded devices Tips for improving performance Future Trends. Why Hardware?. Performance

aden
Download Presentation

Harnessing the Horsepower of OpenGL ES Hardware Acceleration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy. robert.simpson@bitboys.com

  2. Outline • Why hardware? • What is in the hardware • Architecture of embedded devices • Tips for improving performance • Future Trends

  3. Why Hardware? • Performance • Software can only achieve around 1 Mpixels/s (100MHz ARM) • With VGA screen that’s only 3fps! (plus it uses all your CPU) • Current hardware can easily deliver over 100M pixels/s • Power is a bigger issue • Mobile devices are all about conserving power. • Running apps on dedicated hardware is almost always more power efficient. • Functionality • Many features are cheap to implement in hardware e.g. AA, bilinear textures

  4. What is in the Hardware? • Definitely In • Rasterization • Depth buffers • Stencil buffers • Texture pipeline • Alpha blending • Maybe available • Triangle Setup • Transform and lighting • Texture cache • 2D operations • Vertex cache • Programmable vertex shaders (via extension) • Not yet • Programmable pixel shaders (coming soon)

  5. Command Decode T&L Clip Triangle Setup Rasterizer Texture Blend Command Decode T&L Clip Triangle Setup Rasterizer Texture Blend Cache Cache Command Decode T&L Clip Triangle Setup Rasterizer Texture Blend Cache Cache Cache Evolution of hardware

  6. CPU, graphics on the same chip Unified memory architecture Cache for CPU Only limited cache for graphics device No internal framebuffer No internal z buffer Limited texture cache Mobile Architecture

  7. Mobile vs. PC Architecture Mobile PC SOC CPU GPU CPU GPU Memory BUS Chipset Memory Memory

  8. Limited memory bandwidth Hardware number crunching is relatively cheap Moving data around is the main problem Frame buffer access (or geometry buffer access for tiled architectures) is a heavy user Texture bandwidth is an issue CPU cannot perform floating point Rather different from the PC world Means more rapid move to hardware vertex shaders Any advantages? Lower resolution means less data to move around Easier to access frame/depth buffer because of unified memory Mobile Architecture – What this means

  9. Golden rules: Don’t do things you don’t need to do Let the hardware do as much as possible Don’t interrupt the pipeline Minimize data transfer to and from the GPU Tips for Improving Performance

  10. What frame rate do you actually need? Achieving 100fps on a PC may be a good thing LCD displays have slower response Eats power Not using depth testing? Turn off the z buffer Saves at least one read per pixel Not using alpha blending? Turn it off. Saves reading the colour buffer Not using textures? Don’t send texture coordinates to the GPU Reduces geometry bandwidth Seems simple but how many people actually check? Don’t do what you don’t need to do

  11. Hardware is very good at doing computation Transform and lighting is fast Will become even more useful with vertex shaders Texture blending is ‘free’. Don’t use multi-pass Anti-aliasing is very cheap Hardware is good when streaming data But works best when data is aligned with memory Most efficient when large blocks are transferred BUT: Memory bandwidth is still in short supply Let the hardware do it:

  12. Use Vertex Buffers Driver can optimize storage (alignment, format) Reduce Vertex Size Use byte coordinates e.g. for normals Reduce size of textures Textures cached but limited on-chip memory Use mip-mapping. Better quality and lower data transfer Use texture compression where available Minimize Data Transfer

  13. Always best to align data Alignment to bus width very important Alignment to DRAM pages gives some benefit Embedded devices have different types of memory Bus width may be 32 bit, 64 bit or 128 bit Penalty for crossing a DRAM page (typically 1-4Kbytes) Can be several cycles Memory Alignment

  14. Batch Data • Why? • Big overhead for interpreting the command and setting up a new transfer • Memory latency large • How? • Combine geometry into large arrays • Use Vertex Buffer Objects

  15. Batch Data Don’t: Multiple API calls Lots of overhead Slow Do: Fewer API calls Less overhead Much faster

  16. Vertex Buffer Objects Even better: Step 1: Create Step 2: Draw Client Side Server Side VBO Vertices VBO GPU Driver can optimize storage

  17. Vertex Buffer Objects Ultimately: Draw GPU Vertices Local Memory VBO

  18. GPUs are streaming architectures They have high bandwidth but also long latencies Minimize state changes Set up state and then render as much as possible E.g. Don’t mix 2d and 3d operations Don’t lock buffers unnecessarily OK at end of scene Don’t use GetState unnecessarily Main usage is for 3rd party libraries. App should know what the state is All these can cause interruptions to data streaming May even cause the entire pipeline to be flushed Don’t interrupt the pipeline

  19. Major changes of state will always be expensive Changing textures (consider using a texture atlas) Texture environment modes Light types (e.g. turning on specular) Some are quite cheap Turning individual lights on and off Enabling/disabling depth buffer BUT Varies with the hardware (and driver) implementation More on State Changes

  20. Hardware typically holds 8-32 vertices Smaller than in software implementations Likely to vary between implementations Possible to trade no. of vertices against vertex size in some implementations Hardware cannot detect when two vertices are the same. To make efficient use of the cache use: Vertex Strips Vertex fans Indexed triangles Be nice to the Vertex Cache

  21. Individual triangles: 3 vertices/triangle Vertex Strip: 1 vertex triangle (ideal case) Indexed triangles: 0.5 vertices/triangle (ideal case) Vertex Cache

  22. Depth Complexity • Drawing hidden triangles is expensive • Costs: • Transferring the geometry • Transform & lighting • Fetching texels for hidden pixels • Reading and writing the colour and depth buffer • Storing and reading the geometry (tiled or delay stream architectures) • Deferred rendering is not a magic solution • Can even make matters worse in some situations

  23. Depth Complexity - solutions • Render scene front to back • Eliminates many writes to colour and depth buffer • Can reduce texture fetches (early z kill) • Use standard occlusion culling techniques • View frustum culling (don’t draw objects that are out of view) • DPVS • Portals

  24. Future Trends • Vertex shaders commonplace • More local caches for graphics hardware • Display resolution increasing • More pixel pipelines • Better power management • Integration of other functionality • Video

  25. Conclusion • Embedded graphics architecture development is accelerating • Tracking PC development (but taking some ‘short cuts’) • Large variation in the performance of current hardware • Developers cannot rely on hardware to ‘do everything’ • Many techniques for getting the most out of hardware.

  26. Questions?

More Related