1 / 12

Parallel Computers Today

Parallel Computers Today. Two Nvidia 8800 GPUs > 1 TFLOPS. LANL / IBM Roadrunner > 1 PFLOPS. Intel 80-core chip > 1 TFLOPS. TFLOPS = 10 12 floating point ops/sec PFLOPS = 1,000,000,000,000,000 / sec (10 15 ).

kathyjones
Download Presentation

Parallel Computers Today

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computers Today Two Nvidia 8800 GPUs > 1 TFLOPS LANL / IBM Roadrunner > 1 PFLOPS Intel 80-core chip > 1 TFLOPS • TFLOPS = 1012 floating point ops/sec • PFLOPS = 1,000,000,000,000,000 / sec (1015)

  2. Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)

  3. Beowulf (18-processor cluster, lab machine)

  4. AMD Opteron quad-core die

  5. The nVidia G80 GPU • 128 streaming floating point processors @1.5Ghz • 1.5 Gb Shared RAM with 86Gb/s bandwidth • 500 Gflop on one chip (single precision)

  6. U A L The Computer Architecture Challenge • Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices. • Originally, because linear algebra is the middleware of scientific computing. • Nowadays, mostly for bragging rights. P = x

  7. Top 500 List • http://www.top500.org/list/2008/11/100

  8. Generic Parallel Machine Architecture Storage Hierarchy Proc Proc Proc • Key architecture question: Where is the interconnect, and how fast? • Key algorithm question: Where is the data? Cache Cache Cache L2 Cache L2 Cache L2 Cache L3 Cache L3 Cache L3 Cache potential interconnects Memory Memory Memory

  9. 1MB victim 1MB victim 1MB victim 1MB victim Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 Opteron Opteron Opteron Opteron 4GB/s (each direction) FSB FSB Memory Controller / HT Memory Controller / HT 10.6GB/s 10.6GB/s 10.6GB/s 10.6GB/s Chipset (4x64b controllers) DDR2 DRAM DDR2 DRAM 21.3 GB/s(read) 10.6 GB/s(write) Fully Buffered DRAM Intel Clovertown AMD Opteron FPU MT UltraSparc 8K D$ 512K L2 PPE PPE 512K L2 FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC 179 GB/s (fill) FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC 4MB Shared L2 (16 way) Crossbar Switch FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ 90 GB/s (writethru) EIB (Ring Network) MFC 256K SPE SPE EIB (Ring Network) 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC FPU MT UltraSparc 8K D$ MFC 256K SPE SPE 256K MFC MFC 256K SPE SPE 256K MFC 4x128b FBDIMM memory controllers BIF BIF <<20GB/s each direction XDR XDR 42.7GB/s (read), 21.3 GB/s (write) Fully Buffered DRAM 25.6GB/s 25.6GB/s XDR DRAM XDR DRAM Sun Niagara2 IBM Cell Blade Multicore SMP Systems

  10. More Detail on GPU Architecture

  11. Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts • http://www.csm.ornl.gov/workshops/HPA/documents/1-arch/feeding_the_beast_perrone.pdf

  12. Cray XMT (highly multithreaded shared memory)

More Related