160 likes | 337 Views
A many-core GPU architecture. Larrabee. GPUs vs CPUs. Price, performance, and evolution. Definitions. CPU (Central Processing Unit) – general purpose processor able to execute computer programs. GPU (Graphics Processing Unit ) - dedicated graphics rendering device.
E N D
A many-core GPU architecture. Larrabee
GPUs vs CPUs Price, performance, and evolution.
Definitions • CPU (Central Processing Unit) – general purpose processor able to execute computer programs. • GPU (Graphics Processing Unit) - dedicated graphics rendering device.
Price and Performance • The nVIDIAGeForce 6800 Ultra is able to reach a performance of 40 Gflops whereas an Intel 3GHz Pentium4 is able to reach only 6. [1] • What is more impressive, current cards such as ATI HD5870, AMD FireStream9250, NVIDIA GeForce 9800 run between 1 and 3 TFLOPS. • Reasons for this include highly parallel vector processing, fast onboard memory, and pipeline constraints which stream data without stalls.
Evolution • GPU performance has approximately doubled every 6 months since the mid-1990s. • CPU performance doubles every 18 months on average (Moore’s law).
Current trends How we use GPUs.
Alternative applications • New trends are showing GPU use in scientific computing using data-parallel algorithms.Examples include:
Clustering GPU clustering to simulate the dispersion of airborne contaminants in New York City.
Image Stitching Fast seamless stitching and tone-mapping of gigapixel images. (~1 hour on a notebook PC)
Molecular Dynamics Molecular dynamics to evaluate forces between atoms that do not share bonds.
Architecture How it is built.
Key differences TYPICAL GPU • Ordered sequence of rendering steps. • Fixed hardware dedicated to each step. LARABEE • Runs most of its pipeline in software running on multiple general purpose x86 cores. • This allows the rendering pipeline to be reconfigured dynamically. Hence, we are able to skip steps or allocate extra resources when required.
Larrabee CPU Core • The Larrabee core is “derived” from the Pentium processor. • 1 scalar unit for single operations and 1 vector unit for multiple operations. • 32KB L1 data and instruction cache. • 256 KB L2 cache which share a ring network.
Details • 8KB L1 cache is 4 times larger than original Pentium. • This is due to the fact that each core is able to perform four-way multithreading to reduce thread switching overhead. (Not to be confused with simultaneous multithreading.) • The 256KB L2 cache share a ring network. If a core is unable to find data in its own L2 cache, it places a request on a ring bus/network and will eventually find the data in its L2. • Uses a rendering technique called binning, which divides the screen into regions, and renders polygons accordingly.
Benefits of Larrabee Game physics Real-time ray tracing Image and video processing Physical simulation Extended rendering capabilities
References • [1] Zhe Fan, FengQiu, Kaufman A., Yoakum-Stover S. GPU Cluster for High Performance Computing. 2004. ACM / IEEE Supercomputing Conference 2004, November 06-12, Pittsburgh, PA. • [2] L. Seiler et al. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, vl. 27, n. 3, Article 18, August 2008.