870 likes | 1.07k Views
12. Multithreaded Processors. Dezső Sima Fall 2006. D. Sima, 2006. Overview. 1 Introduction. 2 Overview. 3 Coarse grain multithreading. 4 Fine grain multithreading. 5 Simultaneous multithreading. 1. Introduction (1). Aim of multithreading:.
E N D
12. Multithreaded Processors Dezső Sima Fall 2006 D. Sima, 2006
Overview 1 Introduction 2 Overview 3 Coarse grain multithreading 4 Fine grain multithreading 5Simultaneous multithreading
1. Introduction (1) Aim of multithreading: to raise performance compared to superscalar execution or multitasking by increased parallelism at execution. Thread: flow of control Main features of multithreading: Threads • belong to the same process, • share a common address space • (usually, else multiple address translation paths (virtual to real) need to be maintained • in parallel) • are executed simultaneously (overlapped or in parallel). Thread management: • creation, control and termination of threads, • maintaining multiple sets of thread states, • context swithing between threads.
1. Introduction (2) Implementation of multithreading (while executing multithreaded apps/OSs) Software implementation Hardware implementation Execution of multithreaded apps/OSs on a single threaded processor by time sharing Execution of multithreaded apps/OSs on a multithreaded processor concurrently Maintaining multiple threads concurrently by the OS Maintaining multiple threads concurrently by the processor Multithreaded OSs Multithreaded processors Fast context swithing between threads required.
1. Introduction (3) MTcore Core Core L2/L3 L2/L3 L3/Memory L3/Memory Basic options to implement multithreaded processors Multicore processors Multithreaded cores (SMP: Symmetric MultiprocessingCMP: Chip Multiprocessing) Chip
1. Introduction (4) Requirement of software multithreading: Maintaining multiple thread states concurrently by the OS, including: PC, FX/FP registers, state registers Core enhancements needed in case of multithreaded cores: • Maintaining multiple thread states, including: PC, architectural registers, state registers (in case of merged arch. and rename registers providing appropriatly large file sizes (FX/FP)) • Maintaning multiple thread microstates, pertaining to: rename mappings, the RAS (Return Address Stack), ROB, etc. • Providing increased sizes for scarce or sensitive resorces, such as: the instruction buffer, store queue, etc.
1. Introduction (6) Multithreaded OSs: • Windows NT • OS/2 • Unix w/Posix • most OSs developed from the 90’s on
Principle of sequential-, multitask- and multithreaded programming P1 P1 P1 fork() CreateThread() T1 exec() Create Process() P2 T2 fork() T3 P2 P2 T4 Process / Thread Management Example T5 P3 exec() T6 join() P3
Execution of sequential-, multitask- and multithreaded programs Description Key Advantages Key Issues
Implementation of multiprocessing and multithreading (2) OS Support Performance Level Software Development
2. Overview 2.1 Thread scheduling Thread scheduling while implementing software multithreading on a traditional supercalar processor The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed). Figure 2.1: Thread scheduling in a traditional superscalar processor Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading
Thread scheduling in CMP-s Cores execute different threads independently. Figure 2.2: Thread scheduling in an CMP Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading
2. Overview Thread scheduling in multithreaded cores Coarse grain MT
Threads are switched by means of rapid, HW-supported context switches. Figure 2.3: Thread scheduling in a 4-way coarse grained multithreaded processor Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading
2. Overview Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT
The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle.. Figure 2.4: Thread scheduling in a 4-way fine grained multithreaded processor Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading
2. Overview Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)
Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle. SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington). Figure 2.5: Thread scheduling in a 4-way symultaneous multithreaded processor Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading
2.2 Overview of multithreaded cores (1) Single coremulti- threaded Dual coremulti- threaded Multi coremulti- threaded Superscalars RISCs RS64 IV (Sstar) POWER5 IBM (2000) 2T 0.18 /44 mtrs (2004) 2T 0.13 /276 mtrs Alpha 21464 (V8) DEC/Compaq (2003) 4T 0.13 /250 mtrs UltraSPARC T1 (Niagara) (2005) 8 cores/4T 0.09 /279 mtrs Sun Figure 2.6: Multithreaded cores (1)
2.2 Overview of multithreaded cores (2) Single coremulti- threaded Dual coremulti- threaded Multi coremulti- threaded Superscalars CISCs Pentium 4 (Northwood) Pentium EE 840 Intel (2002) 0.13 /55 mtrs (4/2005) 0.09 /230 mtrs Pentium EE 955/965 (Presler) VLIWs (4/2005) 0.065 /2*188 mtrs Montecito Intel (2006?) 2*Itanium 2 (Madison) 0.09 /1730 mtrs. Figure 2.7: Multithreaded cores (2)
2.2 Overview of multithreaded cores (3) Underlying core(s) Scalar core(s) Superscalar core(s) VLIW core(s) SUN UltraSPARC T1 (2005)(Niagara) up to 8 cores, 4 threads IBM RS64 IV (2000) (SStar) 2-way Pentium 4 (2002) 2-way DEC 21464 (2003) Dual-core/2-way IBM POWER5 (2005) Dual-core/2-way Pentium EE 840 (2005) Dual-core/2-way Pentium EE 955/965 (2005) Dual-core/2-way SUN MAJC 5200 (2000) Quad-core/4-way (dedicated use) Intel Montecito (2006?) Dual-core/2-way
3. Coarse grain multithreading 3.1 Overview (1) Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)
3. Coarse grain multithreading 3.1 Overview (2) Coarse grain MT Scalar based Superscalar based VLIW based IBM RS64 IV (2000) (SStar) 2T SUN MAJC 5200 (2000) Quad-core/4T (dedicated use) Intel Montecito (2006?) Dual-core/2T
3.2 Case example 1: IBM RS 64 IV (1) Microarchitecture 4-way superscalar, dual-threaded. Used in IBM’s iSeries and pSeries commercial servers. Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning). Instruction fetch width: 8 instr./cycle Architectural state: • GPRs, FPRs, CR (condition reg.), CTR (count reg.), • spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..) • status and control reg.s, such as T priority. Each T executes in its own effective address space. Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s) Duplicated resources: ~ + 5 % chip area Both single threaded and multithreaded modes of execution.
3.2 Case example 1: IBM RS 64 IV (2) 6XX bus IERAT: Effective to real address translation cache (2x64 entries) Figure 3.1: Microarchitecture of IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898
3.2 Case example 1: IBM RS 64 IV (3) Aim: Commercial workloads • large working sets and • frequently occurring task switches • need for large L1$s • high cach miss rates Thread switching (strongly simplified): Two Ts are implemented; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the miss is serviced, a T switch back to the foreground T occurs. The Thread Swith Buffer holds up to 8 instructions from the background T, to eliminate the latency of the I$ Threads can be allocated different priorities by explicit instructions.
3.2 Case example 1: IBM RS 64 IV (4) Figure 3.2: Thread switch on data cache miss in IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898
3.2 Case example 2: SUN MAJC 5200 (1) Aim: Dedicated use, high-end graphics, networking with wire-speed computational demands. Microarchitecture: • up to 4 processors on a die, • each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced, • each FU has its private logic and register set (e.g. 32 or 64 regs., • the 4 FUs of a processor share a set of global regs., e.g. 64 regs., • all registers are unified (not splitted to FX/FP files), • any FU can process any data type. Each processor is a 4-wide VLIW and can be 4-way multithreaded.
3.2 Case example 2: SUN MAJC 5200 (2) Figure 3.3: General view of SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
3.2 Case example 2: SUN MAJC 5200 (3) Figure 3.4: The principle of private, unified register files associated with each FU Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
3.2 Case example 2: SUN MAJC 5200 (4) Threading Each processor with its 4 FUs can be operated in a 4-way multithreaded mode (called Vertical Multithreading by Sun) Implementation of 4-way multithreading: by executing each T by one of the 4 FUs („Vertical multithreading”) Thread switch: Following a cache miss, the processor saves the T state and begins to process the next T. Example: Comparison of program execution without and with multithreading on a 4-wide VLIW Considered program: • It consists of 100 instructions, • on average 2.5 instrs./cycle executed on average, • giving birth to a cache miss after each 20 instructions. • Latency of serving a cache miss: 75 cycles.
3.2 Case example 2: SUN MAJC 5200 (5) Figure 3.5: Execution for subsequent cache misses in a single threaded processor Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
3.2 Case example 2: SUN MAJC 5200 (6) Figure 3.6: Execution for subsequent cache misses in SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
3.2 Case example 3: Intel Montecito (1) High end servers Aim: Main differencies between Itanim2 and Montecito: • Split L2 caches, • higher unified L3 cache, • duplicated architectural states maintained. Additional support of dual-threading: • the branch prediction structures provide T tagging, • per stack return stack strucktures, • per thread ALATs (Advance Load Address Table) Additional core area needed:~ 2 %.
3.2 Case example 3: Intel Montecito (2) Figure 3.7: Microarchitecture of Intel’s Itanium 2 Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55
3.2 Case example 3: Intel Montecito (3) Figure 3.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table) Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20
3.2 Case example 3: Intel Montecito (4) Thread swithes: 5 event types cause thread switches, such as L3 cache misses, programmed switched hints. Total switch penalty: 15 cycles Example for thread switching: If control logic detects that a thread doesn’t make progress, a thread switch will be initiated.
3.2 Case example 3: Intel Montecito (5) Figure 3.9: Thread switch in Intel’s Montecito vs single thread execution Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20
4. Fine grain multithreading 4.1 Overview (1) Thread scheduling in multithreaded cores Coarse grain MT Fine grain MT Simultaneous MT (SMT)
4. Fine grain multithreading 4.1 Overview (2) Fine grain MT Round robin selection policy Priority based selection policy Scalar based Superscalar based Scalar based VLIW based Superscalar based VLIW based SUN UltraSPARC T1 (2005)(Niagara) up to 8 cores/4T
4.2 Case example: SUN UltraSPARC T1 (1) Aim:Commercial server applications, such as • web servicing, • transaction processing, • ERP (Enterprise Resource Planning), • DSS (Decision Support Systems) Charasteristics of commercial server applications: • large working sets, • poor locality of memory references. • high cache miss rates, • low prediction accuracy for data dependent branches. Memory latency strongly limits performance. Multithreading to hide memory latency.
4.2 Case example: SUN UltraSPARC T1 (2) Structure: • 8 scalar cores, 4-way multithreaded each. • All 32 threads share an L2 cache of 3 MB, built up of 4 banks,
4.2 Case example: SUN UltraSPARC T1 (3) Figure 4.3: Block diagram of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2 Case example: SUN UltraSPARC T1 (2) Structure: • 8 scalar cores, 4-way multithreaded each. • All 32 threads share an L2 cache of 3 MB, built up of 4 banks, • 4 memory channels with on chip DDR2 memory controllers. It runs under Solaris.
4.2 Case example: SUN UltraSPARC T1 (4) Figure 4.3: SUN’s UltraSPARC T1 chip Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf
4.2 Case example: SUN UltraSPARC T1 (5) Processor Elements (Sparc pipes): • Scalar FX-units, 6-stage pipeline • all Processor Elements share a single FP-unit
4.2 Case example: SUN UltraSPARC T1 (6) Figure 4.3: Microarchitecture of the core of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2 Case example: SUN UltraSPARC T1 (5) Processor Elements (Sparc pipes): • Scalar FX-units, 6-stage pipeline • all Processor Elements share a single FP-unit Each thread of a processor element has its private: • PC-logic • register file, • instruction buffer, • store buffer.
4.2 Case example: SUN UltraSPARC T1 (6) Figure 4.3: Microarchitecture of the core of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29