1 / 26

A Look at NVIDIA s Fermi GPU Architecture

elgin
Download Presentation

A Look at NVIDIA s Fermi GPU Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. September 15, 2012 1 A Look at NVIDIA’s Fermi GPU Architecture Richard Walsh Parallel Applications and Systems Manager CUNY HPC Center

    2. GPU vs CPU, Fermi vs Nehalem Functional Unit Foot Print of Instruction-Oriented Nehalem x84-64 ISA and micro-architecture, 4 cores 9/15/2012 2

    3. Functional Unit Foot-Print of Data-Oriented PTX ISA FERMI PTX ISA and micro-architecture, 512 (448) cores GPU vs CPU, Fermi vs Nehalem 9/15/2012 3

    4. Fermi vs Tesla, Quick Comparisons Basic Advances over 3 Generations of NVIDIA GP GPUs 9/15/2012 4

    5. Fermi vs Tesla, Quick Comparisons Additional basic comparisons … Fermi’s clock is somewhat slower, but … Fermi: 1.15 GHz Tesla: 1.44 GHz Fermi has more cores, and … Fermi: 28 SMs x 32 cores = 448 / GPU Tesla: 30 SMs x 8 cores = 240 / GPU Fermi’s 64-bit peak performance is much higher Fermi: 64-bit floating-point 515 Gflops Fermi: 32-bit floating-point 1.03 Tflops Tesla: 32-bit floating-point 1.03 Tflops Tesla: 64-bit floating-point 86 Gflops 9/15/2012 5

    6. Measured Improvement in DP 9/15/2012 6

    7. Fermi vs Tesla, Architecture Top Down 9/15/2012 7

    8. Fermi Streaming Multiprocessor (SM) 9/15/2012 8 Fermi Streaming Multi-processor Local compute: 32 simple cores 1 FPU, 1 ALU (pipelined) 16 load store units 4 special functional units Local data storage: 128KB general purpose registers 64KB shared memory/L1 cache Instruction issue: dual issue (2 warps = 2 x 32 ins) 4 pipelines

    9. Fermi Improves Accuracy over Tesla Fully compliant with IEEE 754-2008 standard Applies to both 32-bit and 64-bit floating-point Includes rounding modes, de-normalized numbers Higher precision, fused multiply-add (FMA) 64-bit not 1/10 32-bit speed, but now 1/2 (like CPU) What does this mean for the programmer? Equivalent (or better) numerical accuracy on GPU Eliminates need to error check GPU results Code kernels for GPU and CPU (more) similarly Somewhat more complex (slower ??) 32-bit pipeline 9/15/2012 9

    10. MAD vs FMA GPU and CPU peak performance is measured in terms of combined multiply-add operations 9/15/2012 10

    11. Fermi Includes Cache unlike Tesla Fermi memory hierarchy now includes cache(s) 768 KB L2 (unified) cache shared by all SMs 64 KB SM shared memory / L1 cache now configurable Divide into 16/48 or 48/16 segments, set by function call Shared by all threads in a block (same SM) What does it mean for the programmer? To program shared memory, must know reference pattern Many algorithms with locality only define it at runtime In this case, ask for large (48K) L1 L2 helps algorithms with data shared across SMs Registers now spill to cache not memory !! Eases programming burden (but encourages laziness ;-) ) Just maximize the size of L1 9/15/2012 11

    12. Fermi Cache Hierarchy 9/15/2012 12

    13. Fermi Improves Memory Hierarchy Fermi’s memory subsystem reworked Cache, memory, and buses have ECC Standard for large HPC systems with long jobs But slows performance 10-15% 64-bit virtual address space (40-bit physical) 1 TByte is current limit (not 4 GBytes) Catches up with CPU memory architecture Hierarchy now a single address space Referenced by a single pointer type Improves performance, eases compiler optimization Enables full C++ support 9/15/2012 13

    14. Fermi Improves Memory Hierarchy Fermi’s memory subsystem reworked (cont.) 6 GDDR5 controllers running at 1.5 GHz Fermi Peak: 6 x 8 bytes x 1.5 GHz x 2 transfers = 144 GB/sec Fermi Best Measured: ~120 GB/sec Tesla Peak: 8 x 8 bytes x 0.8 GHz x 2 transfers = 102 GB/sec Tesla Best Measured: ~ 80 GB/sec Up to 6 Gbytes of device memory, but … Our Fermi’s have only 3 Gbytes of memory Less than our Tesla’s PCI-E 16x Gen 2 bus offers 2-way memory transfers Other things being equal why might one say that peak memory bandwidth on a GPU does not need to be as high as it does on a CPU … ?? 9/15/2012 14

    15. Fermi, Tesla Pointer Differences Tesla’s PTX ISA 1.0 manages three distinct pointer types, while Fermi PTX 2.0 integrates these into a single 64-bit integer address 9/15/2012 15

    16. Fermi Improves Instruction Pipeline Fermi has wider, improved instruction issue pipeline GigaThread Engine maps threads to SM Scheduler Scheduler can now issue 2 warps at once Warp is 32 instructions wide, have no dependencies Warps issued in thread-order Each SM can have 32 x 48 = 1536 threads in flight Scheduler targets 4 pipelined functional unit groups 2 16 unit, 2 cycle, floating-point, integer groups 1 16 unit, 2 cycle, load-store group 1 4 unit, 8 cycle, special function group 9/15/2012 16

    17. Fermi Improves Instruction Pipeline Performance Benefits More threads in flight Fewer bubbles in pipeline More forward progress More latency hiding opportunities 9/15/2012 17

    18. Fermi Instruction Pipeline Fermi warp schedulers, view #1 9/15/2012 18

    19. Fermi Instruction Pipeline Fermi warp scheduler, view #2 9/15/2012 19

    20. Fermi Has a New ISA, PTX 2.0 Fermi’s PTX 2.0 is a virtual instruction set Tesla’s PTX 1.0 was also Mapped to machine instructions at runtime Separates hardware design from compiler design Fermi’s PTX 2.0 adds: Predication for all instructions Marks code for both sides of condition Execute if true, skip if false Smaller hardware footprint, better performance 64-bit virtual addresses (uses only 40-bits) Single unified address space Just one type of load instruction Fine-grained exception handling 9/15/2012 20

    21. Fermi Improves Fine Grained Control Fermi delivers lower average task granularity Atomic memory ops implemented in L2 cache Special dedicated integer unit local to L2 cache 5 to 20x faster synchronizations, reductions Important for pattern recognition, ray tracing, sorting 10x faster kernel-to-kernel context switching GigaThread Engine does this Concurrent kernel execution (from same code) 16 kernels at once (1 per SM) More efficient, packed use of GPU resources Easier to combine computation and rendering Exploit more data-level parallelism 9/15/2012 21

    22. Fermi Kernel Packing Lowers Granularity Advantages of concurrent kernel execution 9/15/2012 22

    23. On-line References, White Papers Here is a good list of references on Fermi: http://www.nvidia.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-TheFirstCompleteGPUComputingArchitecture.pdf http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf http://www.nvidia.com/content/PDF/fermi_white_papers/T.Halfhill_Looking_Beyond_Graphics.pdf http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932 9/15/2012 23

    24. CUNY HPC Center Thank you … Questions? 9/15/2012 24

    25. 9/15/2012 25

    26. 9/15/2012 26

More Related