130 likes | 261 Views
Review: Multiprocessor Basics. Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors?. CMP: Multiprocessors On One Chip.
E N D
Review: Multiprocessor Basics • Q1 – How do they share data? • Q2 – How do they coordinate? • Q3 – How scalable is the architecture? How many processors?
CMP: Multiprocessors On One Chip • By placing multiple processors, their memories and the IN all on one chip, the latencies of chip-to-chip communication are drastically reduced • ARM multi-chip core Configurable # of hardware intr Private IRQ Interrupt Distributor Per-CPU aliased peripherals CPU Interface CPU Interface CPU Interface CPU Interface Configurable between 1 & 4 symmetric CPUs CPU L1$s CPU L1$s CPU L1$s CPU L1$s Snoop Control Unit CCB Private peripheral bus I & D 64-b bus Optional AXI R/W 64-b bus Primary AXI R/W 64-b bus
Multithreading on A Chip • Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions • Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor • Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread • The caches, buffers can be shared (although the miss rates may increase if they are not sized accordingly) • The memory can be shared through virtual memory mechanisms • Hardware must support efficient thread context switching
Types of Multithreading • Fine-grain – switch threads on every instruction issue • Round-robin thread interleaving (skipping stalled threads) • Processor must be able to switch threads on every clock cycle • Advantage – can hide throughput losses that come from both short and long stalls • Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads • Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) • Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread • Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss • Pipeline must be flushed and refilled on thread switches
Multithreaded Example: Sun’s Niagara (UltraSparc T1) • Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe I/O shared funct’s Crossbar 4-way banked L2$ Memory controllers
Niagara Integer Pipeline • Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode Execute Memory WB RegFilex4 ALU Mul Shft Div D$ DTLB Stbufx4 Crossbar Interface Inst bufx4 Thrd Sel Mux I$ ITLB Decode Instr type Thread Select Logic Cache misses Traps & interrupts Thrd Sel Mux Resource conflicts PC logicx4
Simultaneous Multithreading (SMT) • A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP) • Most have more machine level parallelism than most programs can effectively use (i.e., than have ILP) • With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them • Need separate rename tables (ROBs) for each thread • Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle • Intel’s Pentium 4 SMT called hyperthreading • Supports just two threads (doubles the architecture state)
Coarse MT Fine MT Threading on a 4-way SS Processor Example SMT Issue slots → Thread A Thread B Time → Thread C Thread D
Multicore Xbox360 – “Xenon” processor • To provide game developers with a balanced and powerful platform • Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache • 165M transistors total • 3.2 Ghz Near-POWER ISA • 2-issue, 21 stage pipeline, with 128 128-bit registers • Weak branch prediction – supported by software hinting • In order instructions • Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else • An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM • 337M transistors, 10MB framebuffer • 48 pixel shader cores, each with 4 ALUs
DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Core 0 Core 1 Core 2 L1D L1I L1D L1I L1D L1I 1MB UL2 XMA Dec GPU 512MB DRAM BIU/IO Intf SMC MC1 3D Core 10MB EDRAM Video Out Analog Chip Video Out MC0 Xenon Diagram
The PS3 “Cell” Processor Architecture • Composed of a Non-SMP Architecture • 234M transistors @ 4Ghz • 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s • 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else • The PPE is strangely similar to one of the Xenon cores • Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT • The real differences lie in the SPEs (21M transistors each) • An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors • Direct mapped for low latency • 4 vector units per SPE, 1 of everything else – 7M trans.
What about the Software? • Makes use of special IBM “Hypervisor” • Like an OS for OS’s • Runs both a real time OS (for sound) and non-real time (for things like AI) • Software must be specially coded to run well • The single PPE will be quickly bogged down • Must make use of SPEs wherever possible • This isn’t easy, by any standard • What about Microsoft? • Development suite identifies which 6 threads you’re expected to run • Four of them are DirectX based, and handled by the OS • Only need to write two threads, functionally