Larrabee

A many-core GPU architecture. Larrabee

GPUs vs CPUs Price, performance, and evolution.

Definitions • CPU (Central Processing Unit) – general purpose processor able to execute computer programs. • GPU (Graphics Processing Unit) - dedicated graphics rendering device.

Price and Performance • The nVIDIAGeForce 6800 Ultra is able to reach a performance of 40 Gflops whereas an Intel 3GHz Pentium4 is able to reach only 6. [1] • What is more impressive, current cards such as ATI HD5870, AMD FireStream9250, NVIDIA GeForce 9800 run between 1 and 3 TFLOPS. • Reasons for this include highly parallel vector processing, fast onboard memory, and pipeline constraints which stream data without stalls.

Evolution • GPU performance has approximately doubled every 6 months since the mid-1990s. • CPU performance doubles every 18 months on average (Moore’s law).

Current trends How we use GPUs.

Alternative applications • New trends are showing GPU use in scientific computing using data-parallel algorithms.Examples include:

Clustering GPU clustering to simulate the dispersion of airborne contaminants in New York City.

Image Stitching Fast seamless stitching and tone-mapping of gigapixel images. (~1 hour on a notebook PC)

Molecular Dynamics Molecular dynamics to evaluate forces between atoms that do not share bonds.

Architecture How it is built.

Key differences TYPICAL GPU • Ordered sequence of rendering steps. • Fixed hardware dedicated to each step. LARABEE • Runs most of its pipeline in software running on multiple general purpose x86 cores. • This allows the rendering pipeline to be reconfigured dynamically. Hence, we are able to skip steps or allocate extra resources when required.

Larrabee CPU Core • The Larrabee core is “derived” from the Pentium processor. • 1 scalar unit for single operations and 1 vector unit for multiple operations. • 32KB L1 data and instruction cache. • 256 KB L2 cache which share a ring network.

Details • 8KB L1 cache is 4 times larger than original Pentium. • This is due to the fact that each core is able to perform four-way multithreading to reduce thread switching overhead. (Not to be confused with simultaneous multithreading.) • The 256KB L2 cache share a ring network. If a core is unable to find data in its own L2 cache, it places a request on a ring bus/network and will eventually find the data in its L2. • Uses a rendering technique called binning, which divides the screen into regions, and renders polygons accordingly.

Benefits of Larrabee Game physics Real-time ray tracing Image and video processing Physical simulation Extended rendering capabilities

References • [1] Zhe Fan, FengQiu, Kaufman A., Yoakum-Stover S. GPU Cluster for High Performance Computing. 2004. ACM / IEEE Supercomputing Conference 2004, November 06-12, Pittsburgh, PA. • [2] L. Seiler et al. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, vl. 27, n. 3, Article 18, August 2008.

Larrabee

Larrabee

Presentation Transcript

Intel Redefines GPU: Larrabee

Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Exten

Alex Larrabee

Sascha Larrabee Independent study of deanships at Two private institutions of

Well, What Will We Drink? Diana Larrabee Corcoran High School Syracuse City School District

SSO with Microsoft Active Directory Presented by: Craig Larrabee