1 / 42

Larrabee

Larrabee. Eric Jogerst Cortlandt Schoonover Francis Tan. Larrabee. Intel’s new approach to a GPU Considered to be a hybrid between a multi-core CPU and a GPU Combines functions of a multi-core CPU with the functions of a GPU. Larrabee. Larrabee. Fetch. Fetch.

nishan
Download Presentation

Larrabee

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan

  2. Larrabee • Intel’s new approach to a GPU • Considered to be a hybrid between a multi-core CPU and a GPU • Combines functions of a multi-core CPU with the functions of a GPU

  3. Larrabee

  4. Larrabee Fetch

  5. Fetch • Utilizes a hardware prefecther • Supports four threads of execution • Separate register files for each thread • Switches threads in order to cover cases where the compiler is unable to schedule code without stalls or if the prefetcher has not received new instructions • Inactive thread data is written to the core’s local L2 cache

  6. Larrabee Pipeline organization

  7. Pipeline • Pipeline derived from the dual-issue Pentium processor, which is 5-stages • Short, inexpensive execution pipeline • Pairing rules for primary and secondary instruction pipes are deterministic • Allows compilers to perform offline analysis with a wide scope

  8. Pipeline • Pairing rules for primary and secondary instruction pipes are deterministic • Allows compilers to perform offline analysis with a wide scope • All instructions can be issued on the primary pipeline • Minimizes the combinational problems for a compiler • Secondary pipeline can execute a large x86 instruction set • Small and cheap • Power wasted by failing to dual-issue on every cycle is minimal

  9. Pipeline • Each core has own pipeline • Based upon the 5 stage Pentium • Dual issues instructions • In order execution • Pipeline is shared between threads • Hardware can switch between threads that have instructions that have instructions ready to execute

  10. Pipeline • Designed software-rendering pipeline to minimize the number of locks and other synchronization events • Graphics-rendering pipeline written with high-level languages and tools • Enables developers to add innovative rendering capabilities

  11. Larrabee Simd organization

  12. Vector Processor Unit • 16-wide vector processor unit (VPU) • executes integer, single-precision float, and double-precision float instructions • VPU and register are approximately one-third the area of the processor core • Tradeoff • Increased computational density • Wider VPU’s have higher utilization

  13. Vector Processor Unit • VPU instructions can be predicated by a mask register • Mask controls which parts of a vector register or memory location are written and which are left untouched • Advantages • Reduces branch misprediction penalties • Gives instruction scheduler greater freedom

  14. Number of Cores • Many-core processor • Planned to have 24 to 48 cores

  15. Number of Cores

  16. Number of Cores

  17. Larrabee System on-chip components

  18. System On-Chip Components •  x86 computer cores - Dual issue, in order processors that support the x86 protocol with Larrabee extensions.  Connected to ring network and high bandwidth connection to adjacent L2 Cache subset.

  19. System On-Chip Components • L2 Cache subsets • High bandwidth access to adjacent CPU • Connected directly to the ring network • Coherent cache, uses the ring network to check coherency when allocating new cache lines

  20. System On-Chip Components • Ring Network Nodes • Simple bi-directional routers with a 512 bit data path in each direction (1024 bit total bandwidth) • Organized in rings of 8-16 cores and other devices • Interconnected with other rings • All data moved between cores and fixed functional units passes through the ring network

  21. System On-Chip Components • Fixed function logic components • Provides rasterization, interpolation and other commonly needed functions • Directly connected to the ring network • Will be spread among the cores to provide lower latency and load balancing on the ring network

  22. System On-Chip Components • Memory & I/O interface • Provides and manages communication between the Ring Network and off chip devices. • Manages initial routing and tasking of cores

  23. Larrabee Memory Hierarchy

  24. Memory Interface

  25. Larrabee On-Chip Interconnect

  26. On-Chip Interconnect • Ring interconnect bus • Similar to the Sony Cell processor.

  27. Ring Bus

  28. Ring Bus Features • Bi-directional • 512 Bits in each direction • Presumably running at core speed. • Each element can take from one direction on odd CC and other direction on even CC.

  29. Ring Bus Comparisons • Compared to AMD’s R600/RV670 bus, it is half the bit-width. • The higher clock speed of Larrabee’s bus should make up for the difference in bandwidth.

  30. Ring Bus Tradeoff Analysis

  31. Ring Bus Tradeoff Analysis • Pros: • Straightforward, not complex • Able to deliver high bandwidth • Great performance if memory clients need high bandwidth. • Cons: • Waste of chip area if most applications don’t need high memory bandwidth • That area could be spent elsewhere to increase performance in a different way.

  32. Larrabee Multithreading organization

  33. Multithreading Organization • Superscalar • In-Order • Four Threads of execution • Dual issue (with a vector processing unit)

  34. Comparison to OO Execution

  35. Larrabee Vector Processor 8 per clock 8 per clock

  36. Scheduling Policy • Software Controlled • More flexible due to the software controlled scheduling than a typical GPU.

  37. Software Controlled Scheduling Pros Cons Overhead of scheduler takes a bite out of performance Programmer overhead of selecting the correct scheduler. • Flexible: can choose the scheduler to suit the application. • Worst case won’t be so bad. (As compared to a hardware encoded scheduling policy)

  38. Criticism • NVIDIA • “like a GPU from 2006” • Unrealistic performance projections • Motivated by interest to retain market share

  39. Possible Market • Dreamworks Animation • Xbox / Playstation • Scientific research

  40. Questions?

More Related