1 / 11

Performance and Productivity of Emerging Architectures

Performance and Productivity of Emerging Architectures. Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies. Overview. 300. GPU. 200. Performance (GF). 100. CPU. 0. 2004. 1998. 2000. 2002. 2006. Dual-core (Woodcrest) Quad-core (Clovertown) Programming models MPI

kieve
Download Presentation

Performance and Productivity of Emerging Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance and Productivityof Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies

  2. Overview 300 GPU 200 Performance (GF) 100 CPU 0 2004 1998 2000 2002 2006

  3. Dual-core (Woodcrest) Quad-core (Clovertown) Programming models MPI OpenMP Pthreads Recourse contention Memory bandwidth I/O bandwidth x8 x8 SunriseLake PCI-X 10 GbEI/OATiSCSI SAS/SATA-2 PCI-X 10 GbE Intel Xeon multicore CPU DempseyWoodcrestClovertown DIB1066/1333 MHz8.5/10.5 GB/s Up to 21 GB/s ESB-2I/Obridge BlackfordMCH FBD FBD ESI FBD FBD FBD FBD FBD FBD FBD FBD FBD FBD x8 FBD FBD FBD FBD Configurable set of PCIe ports PCIe slot

  4. IBM Cell broadband engine Cell heterogeneous multicore processor 8 synergistic processing element (SPE) cores 200 GFLOPS at 3.2 GHz 200-GB/s memory bandwidth Programming models Multisource compilation Threadlike launch semantics for SPEs SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC SPU SXU LS MFC EIB (up to 96 B/cycle) SPE 16 B/cycle 16 B/cycle 16 B/cycle (2x) PPE 16 B/cycle BIC BIC L2 PPU LI PXU 32 B/cycle 16 B/cycle Flex10 Dual XDR Source: M. Gschwiind et al., Hot Chips-17, August 2005

  5. Graphics processing unit (GPU) NVIDIA 7900GTX 24 SIMD pixelpipelines 200 GFLOPS at 650 MHz 50-GB/s memorybandwidth Programming models Multiple-sourcecompilation Gather semantics define parallelism Host CPU drives program setup and data transfer Memorypartition Memorypartition Memorypartition Vertexshader units Triangle setup Z-cull Shader instruction dispatch Pixel pipelines L2 Tex Fragment crossbar ROP engine Memorypartition

  6. Cray XMT system MTA-I and MTA-II processor architecture XT3/4 scalable infrastructure AMD Torrenzatechnology Fine-grain multithreading(128 concurrent threads) Uniform memory hierarchy in MTA-Iand MTA-II Extreme multithreading with Cray MTA-II Programs runningin parallel 1 2 3 4 i=n i=n Sub- problemA SerialCode i=3 i=1 Concurrent threads of computation Sub- problemB i=2 i=0 i=1 Sub-problem A Hardwarestreams (128) Unusedstreams Instructionready pool Pipeline of executinginstructions

  7. Application kernels

  8. Performance OpenMP improves performance moderately on commodity CPUs High parallelismof Cell and GPUcan improve performance, but more sowhen memoryaccess is regular Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) 8 Cell (8 SPEs) GPU (NVIDIA 7900GTX) MTA-2 (1 processor, 128 streams) 6 Speedup 4 2 0 Molecular Dynamics Covariance Matrix

  9. Productivity Despite small performance increases through OpenMP, the ease of using it means that productivity can still be increased Conversely, the high speed of Cell and GPU means that even substantial effortresults in higher productivity Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) Cell (8 SPEs) GPU (NVIDIA 7900GTX) 3.0 MTA-2 (1 processor, 128 streams) 2.0 Productivity improvement 1.0 0.0 Molecular Dynamics Covariance Matrix

  10. Productivity scaling Increased parallelismfrom all architectureswith littleincreased effort results in higher productivity Reference (Woodcrest, single thread) OpenMP (Woodcrest, 2 cores) OpenMP (Clovertown, 4 cores) MTA-2 (32 processors) 3 2 Speedup 1 0 Molecular Dynamics Covariance Matrix Scaling when accessing multiple devices

  11. Contacts Jeremy Meredith Future Technologies Group (865) 241-5842 jsmeredith@ornl.gov Sadaf Alam Future Technologies Group (865) 241-1533 alamrs@ornl.gov Jeffrey Vetter Future Technologies Group (865) 576-7115 vetter@ornl.gov 11 Meredith_Architectures_SC07

More Related