Dezső Sima 20 11 December

Dezső Sima 2011 December Platforms II. (Ver. 1.6)  SimaDezső, 2011

3. Platform architectures

Contents Platform architectures 3.1. Design space of the basic platform architecture 3.2. The driving force for the evolution of platform architectures 3.3. DT platforms 3.3.1. Design space of the basic architecture of DT platforms 3.3.2. Evolution of Intel’s home user oriented multicore DT platforms 3.3.3. Evolution of Intel’s business user oriented multicore DT platforms 3.4. DP server platforms 3.4.1. Design space of the basic architecture of DP server platforms 3.4.2. Evolution of Intel’s low cost oriented multicore DP server platforms 3.4.3. Evolution of Intel’s performance oriented multicore DP server platforms

Contents 3.5. MP server platforms 3.5.1. Design space of the basic architecture of MP server platforms 3.5.2. Evolution of Intel’s multicore MP server platforms 3.5.3. Evolution of AMD’s multicore MP server platforms

3.1. Design space of the basic platform architecture

3.1 Design space of the basic platform architecture (1) P P P P P P P P .. .. MCH MCH . . . . .. .. ICH ICH Platform architecture Architecture of the processor subsystem Architecture of the I/O subsystem Architecture of the memory subsystem Specifies the structure of the I/O subsystem (Will not be discussed) • Interpreted only for DP/MP systems • In SMPs: Specifies the interconnection • of the processors and the chipset • In NUMAs: Specifies the interconnections • between the processors Specifies • the point and • the layout of the interconnection Example: Core 2/Penryn based MP SMP platform P P P P FSB MCH ICH Processors are connected to the MCH by individual buses • Memory is attached to the MCH • There are serial FB-DIMM channels The chipset consist of two parts designated as the MCH and the ICH

3.1 Design space of the basic platform architecture (2) The notion of Basic platform architecture Platform architecture Architecture of the I/O subsystem Architecture of the memory subsystem Architecture of the processor subsystem Basic platform architecture

3.1 Design space of the basic platform architecture (2) The notion of Basic platform architecture Platform architecture Architecture of the processor subsystem Architecture of the I/O subsystem Architecture of the memory subsystem Basic platform architecture

3.1 Design space of the basic platform architecture (3) Memory Architecture of the processor subsystem Interpreted only for DP and MP systems. The interpretation depends on whether the multiprocessor system is an SMP or NUMA Architecture of the processor subsystem SMP systems NUMA systems Scheme of attaching the processors to the rest of the platform Scheme of interconnecting the processors Examples .. .. P P P P P P .. .. FSB .. .. .. .. MCH P P .. .. .. .. ICH

3.1 Design space of the basic platform architecture (4) P P P P P P P P P P MCH MCH MCH Memory Memory Memory Memory a) Scheme of attaching the processors to the rest of the platform (In case of SMP systems) Scheme of attaching the processors to the rest of the platform MP platforms DP platforms Single FSB Single FSB Dual FSBs Dual FSBs Quad FSBs P P P P P P MCH Memory MCH ..

3.1 Design space of the basic platform architecture (5) Memory Memory Memory Memory b) Scheme of interconnecting the processors (In case of NUMA systems) Scheme of interconnecting the processors Partially connected mesh Fully connected mesh Memory P P P P Memory Memory P P Memory P P

3.1 Design space of the basic platform architecture (6) The notion of Basic platform architecture Platform architecture Architecture of the processor subsystem Architecture of the I/O subsystem Architecture of the memory subsystem Basic platform architecture

3.1 Design space of the basic platform architecture (7) Architecture of the memory subsystem (MSS) Architecture of the memory subsystem (MSS) Point of attaching the MSS Layout of the interconnection

3.1 Design space of the basic platform architecture (8) a) Point of attaching the MSS (Memory Subsystem) (1) Platform MCH Memory ? Platform Processor Memory Point of attaching the MSS

3.1 Design space of the basic platform architecture (9) Point of attaching the MSS – Assessing the basic design options (2) Point of attaching the MSS Attaching memory to the MCH (Memory Control Hub) Attaching memory to the processor(s) • Shorter access time (~ 20– 70%), • Longer access time (~ 20 – 70 %), • As the memory controller is on the MCH die, • the memory type (e.g. DDR2 or DDR3) • and speed grade is not bound to the • processor chip design. • As the memory controller is on the processor die, • the memory type (e.g. DDR2 or DDR3) • and speed grade is bound to the • processor chip design.

3.1 Design space of the basic platform architecture (10) Related terminology Point of attaching the MSS Attaching memory to the MCH (Memory Control Hub) Attaching memory to the processor(s) DT platforms DP/MP platforms DT platforms DP/MP platforms Distributed memory DP/MP systems DT Systems with off-die memory controllers Shared memory DP/MP systems DT Systems with on-die memory controllers NUMA systems SMP systems (Systems w/ non uniform memory access) (Symmetrical Multiporocessors)

3.1 Design space of the basic platform architecture (11) Memory Memory Example 1: Point of attaching the MSS in DT systems Point of attaching the MSS Attaching memory to the MCH Attaching memory to the processor(s) Processor Processor FSB FSB MCH MCH ICH ICH DT System with on-die memory controller DT System with off-die memory controller Examples Intel’s processors before Nehalem Intel’s Nehalem and subsequent processors

3.1 Design space of the basic platform architecture (12) Processor Processor FSB MCH Memory Memory Memory ICH • Distributed memory DP server • aka System w/ non-uniform memory access (NUMA) • Memory scales with the number of processors Intel’s Nehalem and subsequent processors Example 2: Point of attaching the MSS in SMP-based DP servers Point of attaching the MSS Attaching memory to the MCH Attaching memory to the processor(s) Processor Processor FSB MCH ICH • Shared memory DP server • aka Symmetrical Multiprocessor (SMP) • Memory does not scale with the number of processors Examples Intel’s processors before Nehalem

3.1 Design space of the basic platform architecture (13) Point of attaching the MSS Attaching memory to the MCH Attaching memory to the processor(s) Examples UltraSPARC II (1C) (~1997) UltraSPARC III (2001) and all subsequent Sun lines Opteron server lines(2C) (2003) and all subsequent AMD lines AMD’s K7 lines (1C) (1999-2003) POWER4 (2C) (2001) POWER5 (2C) (2005) and subsequent POWER families PA-8800 (2004) PA-8900 (2005) and all previous PA lines Core2 Duo line (2C) (2006) and all preceding Intel lines Core 2 Quad line (2x2C) (2006/2007) Penryn line (2x2C) (2008) Nehalem lines (4) (2008) and all subsequent Intel lines Tukwila (4C) (2010??) Montecito (2C) (2006) Figure: Point of attaching the MSS

3.1 Design space of the basic platform architecture (14) 1 0 1 1 0 0 t MC MC MC b) Layout of the interconnection Layout of the interconnection Attaching memory via serial links Attaching memory via parallel channels Data are transferred over parallel buses Data are transferred over point-to-point links in form of packets 01 15 t E.g: 16 cycles/packet on a 1-bit wide link 01 E.g: 64 bits data + address, command and control as well as clock signals in each cycle t E.g: 4 cycles/packet on a 4-bit wide link Figure: Attaching memory via parallel channels or serial links

3.1 Design space of the basic platform architecture (15) b1) Attaching memory via parallel channels The memory controller and the DIMMs are connected • by a single parallel memory channel • or a few number of memory channels to synchron DIMMs, such as SDRAM, DDR, DDR2 or DDR3 DIMMs. Example 1: Attaching DIMMs via a single parallel memory channel to the memory controller that is implemented on the chipset [45]

3.1 Design space of the basic platform architecture (16) Example 2: Attaching DIMMs via 3 parallel memory channels to memory controllers implemented on the processor die (This is actually Intel’s the Tylersburg DP platform, aimed at the Nehalem-EP processor, used for up to 6 cores) [46]

3.1 Design space of the basic platform architecture (17) SDRAM 168-pin DDR 184-pin DDR2 240- pin DDR3 240-pin The number of lines of the parallel channels The number of lines needed depend on the kind of the memory modules, as indicated below: All these DIMM modules provide an 8-byte wide datapath and optionally ECC and registering.

3.1 Design space of the basic platform architecture (18) .. .. .. .. .. .. b2) Attaching memory via serial links Serial memory links are point-to-point interconnects that use differential signaling. Attaching memory via serial links Serial links attach S/P converters w/ parallel channels Serial links attach FB-DIMMs S/P Serial link Serial link .. Proc. /MCH Proc. /MCH .. .. Serial link .. Serial link .. S/P FB-DIMMs provide buffering and S/P conversion

3.1 Design space of the basic platform architecture (19) Example 1: FB-DIMM links in Intel’s Bensley DP platform aimed at Core 2 processors-1 Xeon 5400 (Harpertown) 2x2C Xeon 5200 (Harpertown) 2C Xeon 5000 (Dempsey) 2x1C Xeon 5100 (Woodcrest) 2C Xeon 5300 (Clowertown) 2x2C / / / / / 65 nm Pentium 4 Prescott DP (2x1C)/Core2 (2C/2*2C) 65 nm Pentium 4 Prescott DP (2C)/Core2 (2C/2*2C) FSB E5000 MCH ESI 631*ESB/632*ESB IOH FB-DIMM w/DDR2-533 ESI: Enterprise System Interface 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction)

3.1 Design space of the basic platform architecture (20) Example 2: SMI links in Intel’s the Boxboro-EX platform aimed at the Nehalem-EX processors-1 Xeon 6500 (Nehalem-EX) (Becton) Xeon E7-2800 (Westmere-EX) or SMB SMB SMB SMB Nehalem-EX (8C) Westmere-EX (10C) Nehalem-EX (8C) Westmere-EX (10C) QPI SMB SMB SMB SMB QPI QPI SMI links SMI links DDR3-1067 DDR3-1067 7500 IOH ME SMI: Serial link between the processor and the SMB SMB: Scalable Memory Buffer with Parallel/serial conversion ESI ICH10 Nehalem-EX aimed Boxboro-EX scalable DP server platform (for up to 10 cores)

3.1 Design space of the basic platform architecture (21) Example 2: The SMI link of Intel’s the Boxboro-EX platform aimed at the Nehalem-EX processors-2 [26] SMB • The SMI interfacebuilds on the Fully Buffered DIMM architecture with a few protocol changes, • such as those intended to support DDR3 memory devices. • It has the same layout as FB-DIMM links (14 outbound and 10 inbound differential lanes • as well as a few clock and control lanes). • It needs altogether about 50 PC trails.

3.1 Design space of the basic platform architecture (22) Design space of the architecture of the MSS Layout of the interconnection Attaching memory via parallel channels Attaching memory via serial links Serial links attach S/P- converters w/ par. channels Parallel channels attach DIMMs Serial links attach FB-DIMMs .. S/P .. .. .. .. Attaching memory to the MCH MCH MCH MCH . . . . .. .. .. .. S/P .. .. Point of attaching memory .. .. S/P S/P .. .. .. .. .. .. .. .. Attaching memory to the processor(s) P P . . P P . . P P . . .. .. . . . . . . . . .. .. .. .. .. .. S/P S/P .. .. .. ..

3.1 Design space of the basic platform architecture (23) Max. number of memory channels that can be implemented while using particular design options of the MSS Subsequent fieldsfrom left to right and from top to down of the design space of the architecture of MSS allow to implement an increasing number of memory channels (nM), as discussed in Section 4.2.5 and indicated in the next figure.

3.1 Design space of the basic platform architecture (24) Layout of the interconnection Design space of the architecture of the MSS Attaching memory via parallel channels Attaching memory via serial links Serial links attach S/P- converters w/ par. channels Parallel channels attach DIMMs Serial links attach FB-DIMMs .. S/P .. .. .. .. Attaching memory to the MCH MCH MCH MCH . . . . .. .. .. .. S/P .. .. Point of attaching memory .. .. S/P S/P .. .. .. .. .. .. .. .. Attaching memory to the processor(s) P P . . P P . . P P . . .. .. . . . . . . . . .. .. .. .. .. .. S/P S/P .. .. .. .. nC

3.1 Design space of the basic platform architecture (25) The design space of the basic platform architecture-1 Platform architecture Architecture of the processor subsystem Architecture of the I/O subsystem Architecture of the memory subsystem Basic platform architecture

3.1 Design space of the basic platform architecture (26) The design space of the basic platform architectures-2 Obtained as the combinations of the options available for the main aspects discussed. Basic platform architecture Architecture of the processor subsystem Architecture of the memory subsystem (MSS) Point of attaching the MSS Layout of the interconnection Scheme of attaching the processors (In case of SMP systems) Scheme of interconnecting the processors (In case of NUMA systems)

3.1 Design space of the basic platform architecture (27) The design space of the basic platform architectures of DT, DP and MP platforms will be discussed next in the Sections 3.3.1, 3.4.1 and 3.5.1. Design space of the basic architecture of particular platforms Design space of the basic architecture of DT platforms Design space of the basic architecture of DP server platforms Design space of the basic architecture of MP server platforms Section 3.4.1 Section 3.5.1 Section 3.3.1

3.2. The driving force for the evolution of platform architectures

3.2The driving force for the evolution of platform architectures (1) The peak per processor bandwidth demand of a platform Let’s consider a single processor of a platform and the bandwidth available for it (BW). The available (peak) memory bandwidth of a processor (BW) is a product of • the number of memory channels available per processor (nM) , • theirwidth (w) as well as • the transfer rate of the memory used (fM): BW = nM x w x fM • BW needs to be scaled with the peak performance of the processor. • The peak performance of the processor increases linearly with the core count (nC). The per processor memory bandwidth (BW) needs to be scaled with the core count (nC).

3.2The driving force for the evolution of platform architectures (2) If we assume a constant width for the memory channel (w = 8 Byte), it can be stated that nM x fM needs to be scaled with the number of cores that is it needs to be doubled approximately every two years. This statement summarizes the driving force • for raising the bandwidth of the memory subsystem, and • at the same time also it is the major motivation for the evolution of platform architectures.

3.2The driving force for the evolution of platform architectures (3) The bandwidth wall • As recently core counts (nC) double roughly every two years, also the per processor • bandwidth demand of platforms(BW)doubles roughly every two years, as discussed before, • On the other hand, memory speed (fM)doubles only approximately every four years, • as indicated e.g. in the next figure for Samsung’s memory technology.

3.2The driving force for the evolution of platform architectures (4) Evolution of the memory technology of Samsung [12] The time span between e.g. DDR-400 and DDR3-1600 is approximately 7 years, this means roughly a doubling of memory speeds (fM) in every 4 years.

3.2The driving force for the evolution of platform architectures (5) • This fact causes a widening gap between the bandwidth demand and achievable bandwidth • growth due to increasing memory speed. • This fact can be designated as the bandwidth wall. It is the task of the developers of platform architectures to overcome the bandwidth wall by providing the needed number of memory channels .

3.2The driving force for the evolution of platform architectures (6) nM(nC) √2 x √nC = The square root rule of scaling the number of memory channels It can be shown that in case when the core count (nC) increases according to Moore’s law and memory subsystems will be evolved by using in them the fastest available memory devices, as typical, then the number of per processor available memory channels needs be scaled as to provide altogether a linear bandwidth scaling with the core count (nC). Then the scaled number of memory channels available per processor (nM(nC)) and the increased device speed (fM) together will provide the needed linear scaling of the per processor bandwidth (BW) with nC. The above relationship can be termed as the square root rule of scaling the number of memory channels. Remark For multiprocessors incorporating nP processors then the total number of memory channels of the platform (NM) amounts to NM = nP x nM

3.3. DT platforms 3.3.1. Design space of the basic architecture of DT platforms 3.3.2. Evolution of Intel’s home user oriented multicore DT platforms 3.3.3. Evolution of Intel’s business user oriented multicore DT platforms

3.3 DT platforms 3.3 DT platforms 3.3.1 Design space of the basic architecture of DT platforms

3.3.1 Design space of the basic architecture of DT platforms (1) P .. MCH .. .. .. P .. .. .. .. .. ICH .. P .. .. S/P S/P S/P .. .. .. .. .. .. MCH P . . . . . . .. .. .. S/P S/P S/P .. .. .. ICH .. .. .. P .. .. .. P MCH . . . . . . .. .. .. ICH Point of attaching the MSS DT platforms Attaching memory to the MCH Attaching memory to the processor Attaching memory via parallel channels Parallel channels attach DIMMs Pentium D/EE to Penryn (Up to 4C) 1. G. Nehalem to Sandy Bridge (Up to 6C) Layout of the interconnection Serial links attach FB-DiMMs Attaching memory via serial links Serial links attach. S/P conv. w/ par. chan. No. of mem. channels

3.3.1 Design space of the basic architecture of DT platforms (2) Evolution of Intel’s DT platforms (Overview) Point of attaching the MSS Attaching memory to the processor Attaching memory to the MCH 1. G. Nehalem 4C (2008) Westmere-EP 6C (2010) 2. G. Nehalem 4C (2009) Westmere-EP 2C+G (2010) Sandy Bridge 2C/4C+G (2011) Sandy Bridge-E 6C (2011) Attaching memory via parallel channels Pentium D/EE 2x1C (2005/6) Core 2 2C (2006) Core 2 Quad 2x2C (2007) Penryn 2C/2x2C (2008) Parallel channels attach DIMMs Layout of the interconnection Seria l links attach FB-DIMMs No. of memory channels No need for higher memory bandwidth through serial memory interconnection Attaching memory via serial links Serial links attach S/P converters w/ par. channels No. of memory channels

3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (1) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (1) Up to DDR3-1067 Up to DDR2-667 Up to DDR2-800 Up to DDR3-1067 Pentium D/ Pentium EE (2x1C) 1. gen. Nehalem (4C)/Westmere-EP (6C) Core2 2C Core 2 Quad (2x2C) /Penryn (2C/2*2C) 2/4 DDR2 DIMMs up to 4 ranks FSB FSB QPI 965/3-/4- Series MCH 945/955X/975X MCH X58 IOH DMI DMI DMI ICH8/9/10 ICH7 ICH10 Bridge Creek (2006) (Core 2 aimed) Salt Creek (2007) (Core 2 Quad aimed) Boulder Creek (2008) (Penryn aimed) Anchor Creek (2005) Tylersburg (2008)

3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (2) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (2) Up to DDR3-1067 Up to DDR3-1333 Up to DDR3-1333 1. gen. Nehalem (4C)/Westmere-EP (6C) 2. gen. Nehalem (4C)/Westmere-EP (2C+G) Sandy Bridge (4C+G) QPI X58 IOH DMI DMI2 FDI FDI 5- Series PCH 6- Series PCH DMI ICH10 Sugar Bay (2011) Tylersburg (2008) Kings Creek (2009)

3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (3) 1. gen. Nehalem (4C)/Westmere-EP (6C) QPI X58 IOH DMI ICH10 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (3) Up to DDR3-1600 Up to DDR3-1067 Sandy Bridge-E (4C)/6C) DMI2 X79 PCH DDR3-1600: up to 1 DIMM per channel DDR3-1333: up to 2 DIMMs per channel Tylersburg (2008) Waimea Bay (2011)

3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (1) 3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (1) Up to DDR2-800 Up to DDR3-1067 Up to DDR3-1333 Up to DDR2-667 Pentium D/ Pentium EE (2x1C) Core2 (2C) Core 2 Quad (2x2C) Penryn (2C/2*2C) 2. gen. Nehalem (4C)Westmere-EP(2C+G) 2/4 DDR2 DIMMs up to 4 ranks FSB FSB DMI FDI Q965/Q35/Q45 MCH 945/955X/975X MCH Q57 PCH ME ME PCIe 2.0/SMbus 2.0 DMI C-link DMI ICH8/9/10 ICH7 82578 GbE LAN PHY LCI/GLCI LCI Gigabit Ethernet LAN connection 82566/82567 LAN PHY 82573E GbE (Tekoe) ME Gigabit Ethernet LAN connection Gigabit Ethernet LAN connection Averill Creek (2006) (Core 2 aimed) Weybridge (2007) (Core 2 Quad aimed) McCreary (2008) (Penryn aimed) Piketon (2009) Lyndon (2005)

3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (2) 3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (2) Up to DDR3-1333 Up to DDR3-1333 Sandy Bridge (4C+G) 2. gen. Nehalem (4C)Westmere-EP(2C+G) DMI2 DMI FDI FDI Q67 PCH Q57 PCH ME ME PCIe 2.0/SMbus 2.0 PCIe 2.0/SMbus 2.0 GbE LAN 82578 GbE LAN PHY Gigabit Ethernet LAN connection Gigabit Ethernet LAN connection Piketon (2009) Sugar Bay (2011)

3.4. DP server platforms 3.4.1. Design space of the basic architecture of DP server platforms 3.4.2. Evolution of Intel’s low cost oriented multicore DP server platforms 3.4.3. Evolution of Intel’s performance oriented multicore DP server platforms

Dezső Sima 20 11 December

Dezső Sima 20 11 December

Presentation Transcript

20 december 2012

20 december 2010

Sima Dezső Óbudai Egyetem 20 13 november

December 20, 2012

Sima Dezső Óbudai Egyetem 20 14 május

Sima Qian

SIMA / SIAL???

Dezső Sima

Dezső Sima 20 12 Mai

Dezső Sima

Dezső Sima 20 11 December

December 20, 2011

December 20, 2009

DECEMBER '11