420 likes | 629 Views
Larrabee. Eric Jogerst Cortlandt Schoonover Francis Tan. Larrabee. Intel’s new approach to a GPU Considered to be a hybrid between a multi-core CPU and a GPU Combines functions of a multi-core CPU with the functions of a GPU. Larrabee. Larrabee. Fetch. Fetch.
E N D
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan
Larrabee • Intel’s new approach to a GPU • Considered to be a hybrid between a multi-core CPU and a GPU • Combines functions of a multi-core CPU with the functions of a GPU
Larrabee Fetch
Fetch • Utilizes a hardware prefecther • Supports four threads of execution • Separate register files for each thread • Switches threads in order to cover cases where the compiler is unable to schedule code without stalls or if the prefetcher has not received new instructions • Inactive thread data is written to the core’s local L2 cache
Larrabee Pipeline organization
Pipeline • Pipeline derived from the dual-issue Pentium processor, which is 5-stages • Short, inexpensive execution pipeline • Pairing rules for primary and secondary instruction pipes are deterministic • Allows compilers to perform offline analysis with a wide scope
Pipeline • Pairing rules for primary and secondary instruction pipes are deterministic • Allows compilers to perform offline analysis with a wide scope • All instructions can be issued on the primary pipeline • Minimizes the combinational problems for a compiler • Secondary pipeline can execute a large x86 instruction set • Small and cheap • Power wasted by failing to dual-issue on every cycle is minimal
Pipeline • Each core has own pipeline • Based upon the 5 stage Pentium • Dual issues instructions • In order execution • Pipeline is shared between threads • Hardware can switch between threads that have instructions that have instructions ready to execute
Pipeline • Designed software-rendering pipeline to minimize the number of locks and other synchronization events • Graphics-rendering pipeline written with high-level languages and tools • Enables developers to add innovative rendering capabilities
Larrabee Simd organization
Vector Processor Unit • 16-wide vector processor unit (VPU) • executes integer, single-precision float, and double-precision float instructions • VPU and register are approximately one-third the area of the processor core • Tradeoff • Increased computational density • Wider VPU’s have higher utilization
Vector Processor Unit • VPU instructions can be predicated by a mask register • Mask controls which parts of a vector register or memory location are written and which are left untouched • Advantages • Reduces branch misprediction penalties • Gives instruction scheduler greater freedom
Number of Cores • Many-core processor • Planned to have 24 to 48 cores
Larrabee System on-chip components
System On-Chip Components • x86 computer cores - Dual issue, in order processors that support the x86 protocol with Larrabee extensions. Connected to ring network and high bandwidth connection to adjacent L2 Cache subset.
System On-Chip Components • L2 Cache subsets • High bandwidth access to adjacent CPU • Connected directly to the ring network • Coherent cache, uses the ring network to check coherency when allocating new cache lines
System On-Chip Components • Ring Network Nodes • Simple bi-directional routers with a 512 bit data path in each direction (1024 bit total bandwidth) • Organized in rings of 8-16 cores and other devices • Interconnected with other rings • All data moved between cores and fixed functional units passes through the ring network
System On-Chip Components • Fixed function logic components • Provides rasterization, interpolation and other commonly needed functions • Directly connected to the ring network • Will be spread among the cores to provide lower latency and load balancing on the ring network
System On-Chip Components • Memory & I/O interface • Provides and manages communication between the Ring Network and off chip devices. • Manages initial routing and tasking of cores
Larrabee Memory Hierarchy
Larrabee On-Chip Interconnect
On-Chip Interconnect • Ring interconnect bus • Similar to the Sony Cell processor.
Ring Bus Features • Bi-directional • 512 Bits in each direction • Presumably running at core speed. • Each element can take from one direction on odd CC and other direction on even CC.
Ring Bus Comparisons • Compared to AMD’s R600/RV670 bus, it is half the bit-width. • The higher clock speed of Larrabee’s bus should make up for the difference in bandwidth.
Ring Bus Tradeoff Analysis • Pros: • Straightforward, not complex • Able to deliver high bandwidth • Great performance if memory clients need high bandwidth. • Cons: • Waste of chip area if most applications don’t need high memory bandwidth • That area could be spent elsewhere to increase performance in a different way.
Larrabee Multithreading organization
Multithreading Organization • Superscalar • In-Order • Four Threads of execution • Dual issue (with a vector processing unit)
Larrabee Vector Processor 8 per clock 8 per clock
Scheduling Policy • Software Controlled • More flexible due to the software controlled scheduling than a typical GPU.
Software Controlled Scheduling Pros Cons Overhead of scheduler takes a bite out of performance Programmer overhead of selecting the correct scheduler. • Flexible: can choose the scheduler to suit the application. • Worst case won’t be so bad. (As compared to a hardware encoded scheduling policy)
Criticism • NVIDIA • “like a GPU from 2006” • Unrealistic performance projections • Motivated by interest to retain market share
Possible Market • Dreamworks Animation • Xbox / Playstation • Scientific research