1 / 33

NVIDIA Hardware

NVIDIA Hardware. Karl Hillesland November 2, 2000. Major release in fall, improvement in spring NV10: GeForce 256 (Fall 1999) NV15: GeForce2 GTS (Spring 2000) NV11: GeForce2 MX (Summer 2000) NV16: GeForce2 Ultra (Fall 2000) NV20: ??? (Anandtech: Dec 2000 - April 2001)

lorie
Download Presentation

NVIDIA Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NVIDIA Hardware Karl Hillesland November 2, 2000

  2. Major release in fall, improvement in spring NV10: GeForce 256 (Fall 1999) NV15: GeForce2 GTS (Spring 2000) NV11: GeForce2 MX (Summer 2000) NV16: GeForce2 Ultra (Fall 2000) NV20: ??? (Anandtech: Dec 2000 - April 2001) NV25?: X-Box (Fall 2001) Cards discussed

  3. 0.22um, 23 M transistors 120 MHz core 128 bit, 166 MHz SDR or 150 MHz DDR, up to 128 MB (64 MB biggest I’ve ever heard of) AGP 4x with fast writes 350 MHz RAMDAC DVD TV-out GeForce 256

  4. 15 MTris/s (BenMark5 gives 13M. Have seen other references to 14.5M) Up to 6 triangles “in-flight” at a time 2 matrix Vertex skinning Texture coordinate generation (+emboss, reflection, cube map) 8 lights GeForce 256 Triangles

  5. BenMark5NV10: 13 MTris/s, NV15: 24 MTris/s

  6. Four Independent Pipelined Engines QuadEngineTM Architecture (from summer 99 notes) Transform Engine Lighting Engine Setup Engine Rendering Engine • Industry-leading 3D performance • 15-25M triangles/second • Sustained DMA, transform/clip/light, setup, rasterize and render rate • Extremely efficient • >70% of the chip active at all times • Up to 6 triangles “in flight” at a time • Super-pipelined design • Very low latency between engines

  7. 4 pixel pipes, one texture each. Can do 2-texture multi-texturing by coupling pipes 24/8 bit Z/stencil, 32 bit color (note: 4*(24+8+32)=256) Register Combiners Texture Compression 8-tap anisotropic filtering range based fog anti-aliasing(?) GeForce 256 pixels/texels

  8. 2 textures per pipe 25M Transistors 0.18 Micron technology 200 MHz core clock, 166 MHz DDR (“333” MHz) 25M Tris/s (BenMark5 gives 24M Tris/s) Flat panel GeForce 256 -> GeForce2 GTS

  9. Remove two pixel pipes (left with 2, 2 textures each) Dual head support “Digital Vibrance Control” Low power and heat Slower Core Clock (175 MHz) Either 64 or 128 bit memory possible Cheaper: (intended for ~ $100 range) GeForce2 GTS  GeForce2 MX

  10. Faster core clock: 250 MHz Faster memory: 225 MHz DDR ( “450” MHz) Expensive: ~ $500 GeForce2 GTS  GeForce2 Ultra

  11. Increased clock rates Acceleration of some common CAD-oriented features (.e.g, anti-aliased lines) GeForce  Quadro

  12. AGP 4x : 1.2 GB/s Video memory: 333 MHz * 128 bits = 5.3 GB/s PCI: 132 MB/s Host: PC100 with SDRAM = 1.6 GB/s Bandwidths

  13. Q3 -> 18 bytes per vertex position 2 * 3 = 6 bytes texture coords, 2 textures: 2 * 2 * 2 = 8 bytes color: 4 bytes The double eagle: 10/16 bytes per vertex position 2 * 3 = 6 bytes color: 4 bytes Vertex Bandwidth

  14. AGP 4x : 1.2 GB/s / 18 = 67 M Verts/s Video memory: 5.3 GB/s / 18 = 294 M Verts/s PCI: 132 MB/s / 18 = 7.3 M Verts/s Host: PC100 with SDRAM: 1.6 GB/s / 18 = 88 M Verts/s Vertex Bandwidth, Q3

  15. Assume “perfect strips” (one new vertex for each triangle) Each triangle -> 3 indices, 1 new vertex 18 + 2 bytes/index * 3 indicies/tri = 20 bytes/tri indicies and verticies may come across different busses Vertex cache can save some bandwidth Add indices

  16. Texture Environment 1 Texture Compositing Fragment Color Texture Environment 0 Texture Fetching Tex0 Tex1 Specular Color Sum Specular Color Fog Application Fog Color/Factor

  17. Replaces blending of fragment, texture, fog, and secondary colors. Provides configurable 8-bit, signed math per-pixel operations Cascading of register combiners for more sophisticated computations (Hardware limit on levels. Currently 2) Register Combiners

  18. Register Combiners 4 RGB Inputs Fragment Color 4 Alpha Inputs General Combiner 0 3 RGB Outputs Specular Color 3 Alpha Outputs Fog Color/Factor 4 RGB Inputs 4 Alpha Inputs General Combiner 1 Register Set Texture 0 Texture Fetching 3 RGB Outputs 3 Alpha Outputs Texture 1 Spare 0 Specular Color Final Combiner 6 RGB Inputs 1 Alpha Input

  19. Input mappings Invert Negate Bias by 1/2 Expand by 2 Output mappings Bias by 1/2 Scale by 1/2, 2 or 4 Input/Output mappings

  20. A B + C D -or- A B mux C D A B -or- A  B C D -or- C  D General Combiner, RGB input registers output registers A A RGB RGB inputmap inputmap inputmap inputmap primary color primary color secondary color secondary color A B C D texture 0 texture 0 texture 1 texture 1 spare 0 spare 0 spare 1 spare 1 scaleandbias fog fog constant color 0 constant color 0 constant color 1 constant color 1 zero zero not writeable not readable computations

  21. A B + C D -or- A B mux C D General Combiner, Alpha input registers output registers A A RGB RGB inputmap inputmap inputmap inputmap primary color primary color secondary color secondary color A B C D texture 0 texture 0 texture 1 texture 1 spare 0 spare 0 spare 1 spare 1 scaleandbias A B fog fog constant color 0 constant color 0 constant color 1 constant color 1 C D zero zero not writeable not readable

  22. Final Combiner inputmap inputmap input registers A RGB primary color E F secondary color E F texture 0 spare 0 +secondary color texture 1 inputmap inputmap inputmap inputmap inputmap spare 0 spare 1 A B C D G fog fragment RGB out A B + ( 1 - A) C + D constant color 0 constant color 1 fragment Alpha out G zero

  23. Intel PIII/733 with 238 KB cache 250-300 MHz Core DVD, hard disk custom sound with 64 3D-audio channels X-Box (Abrash on Dr. Dobbs)

  24. 125 M Tris gouraud, transformed, shaded, two textures. +one infinite light, 62.45 MTris/sec, 8 local lights 8 MTris/sec 125 M particles/s (single color front-facing squares) Vertex Programs Surface engine “works with CPU” for Catmull-Clark, Bezier, Loop, and uniform B-splines at 50Mtris/sec X-Box Transform/lighting

  25. Replaces transformation and lighting Custom vertex lighting Custom skinning and blending Custom texture coordinate generation Custom matrix operations Custom vertex computations of your choice Vertex Programs

  26. Input is untransformed, unlit vertex Create a transformed vertex Optionally compute lighting texture coordinates fog coordinates point sizes Vertex Programs

  27. Does 4-vector fixed point math 17 Instructions: ARL, MOV, MUL, ADD, MAD, RCP, RSQ, DP3, DP4, DST, MIN, MAX, SLT, SGE, EXP, LOG, LIT Vertex Programs cont.

  28. Vertex Program Registers 16x4 Vertex Attribute Registers 96x4 Program Parameters (e.g, modelview projection matrix) Vertex Program 128 instructions 12x4 Temporary registers 15x4 Vertex Result Registers

  29. Programs are arrays of GLubytes(“strings”) Created/managed similar to texture objects No penalty for switching in and out of vertex program mode execution time ~proportional to length of program Using Vertex Programs (OpenGL)

  30. UMA with GPU in control 64 MB, 128 bit, 200 MHz DDR RAM 1 GPix/sec fill rate + “occlusion circuitry” “automatic z compression” X-Box memory bandwidth

  31. X-Box bandwidth diagram

  32. 4 textures per pixel (but takes two clocks for >2) One texture can be used as lookup to next texture 8 general register combiners + final combiner 3D Textures Cube maps, compression, etc. 2 or 4 sample anti-aliasing X-Box Textures

  33. DXTC/S3TC Pre-compressed (DDS file) Compressed by driver DXT1/S3TC, DXT3, DXT5 (not DXT2, DXT4) Ugly (be careful of trickery though) Texture compression (OpenGL)

More Related