600 likes | 773 Views
Multicore: Commercial Processors. Some Examples. Desktop and Server/Enterprise Space Intel AMD SUN Microsystems The Embedded Space: Freescale Semiconductor. Focus. The Chip Level Architecture What do we have on chip? The Core Architecture
E N D
Some Examples • Desktop and Server/Enterprise Space • Intel • AMD • SUN Microsystems • The Embedded Space: Freescale Semiconductor
Focus • The Chip Level Architecture • What do we have on chip? • The Core Architecture • Note the presence/absence/configuration of concepts studied earlier in class • Rationalize the design decisions that led to the preceding • What can/should we expect next? • Building systems using multicore chips
Intel Core Duo • Homogeneous cores • Bus based on chip interconnect • Shared Memory • Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Source: Intel Corp. Large, shared set associative, prefetch, etc.
Intel Core Duo: Vital Stats • 151 million transistors; Shared 2 MB L2 cache • Each core has a 12 stage pipeline (Yonah) • Low-power (less than 25 watts) Dual Core microprocessor • Supports Intel’s Vanderpool virtualization technology • EM64T (Intel x86-64 extensions) is not supported • Desktop market – not severe due to lack of OS and software • Sossaman processor for servers, which is based on Yonah, also lacks EM64T-support severe disadvantage • Communication between the L2 cache and both execution cores is handled by an arbitration bus unit • Eliminates cache coherency traffic over the FSB • Raises the core-to-L2 latency • The increase in clock frequency offsets the impact • Core processors communicate with the system chipset over a 667 MT/s front side bus (FSB), up from 533 MT/s used by the fastest Pentium M. • Intel Core Solo uses the same two-core die as the Core Duo, but features only one active core • Chips failing quality control can be sold • Core 2 Duo processors will also include the ability to disable one core to conserve power
The Core™ micro-architecture Source: Ars Technica
The Core Execution core Source: Ars Technica
Intel Core Duo • High memory latency due to the lack of on-die memory controller (further aggravated by system-chipset's use of DDR-II RAM) • Main-memory transactions have to pass through the Northbridge of the chipset • Higher latency compared to the AMD's Turion platform. • Weakness shared by the entire line of Pentium processors • L2-cache is quite effective at hiding main-memory latency • Execution units • Three 64-bit integer exec units • one CIU (complex) + two SIU (simple) • Two FPUs • Poor Floating Point Unit (FPU) throughput • Limited to little "performance per watt" in single threaded applications compared to its predecessor.
Core 2 Duo and Core Duo Source: Intel Corp. • Very similar architectures • Bump in the processor speed • Increase in Level 2 cache. (2MB to 4MB) • Both chips have a 65-nm process technology architecture and support a 667 MHz front-side-bus (FSB). • 14 stage pipeline
Intel Core 2 Duo Source: Hard Core Hardware
Wide Dynamic Execution Source: Bit Tech
Wide Dynamic Execution Source: Bit Tech
Wide Dynamic Execution • Pipe width of 4 execution units per chip (Pentium M/Pentium 4 Netburst have 3) • Delivery of more instructions per clock cycle • Pipeline depth of 14 vs. 31 in Pentium Prescott 4 • Compromise between efficient execution of short instructions and long instructions • Ops fusion • Less work for the processor pipeline to run • Micro-ops fusion • fuse together repetitive instructions in x86 code • Macro-ops fusion • works on the x86 instructions themselves, not just their micro derivatives. • Instruction loads and micro-ops can be reduced by approximately 15% and 10%, respectively
Intelligent Power Capability Source: Bit Tech
Intelligent Power Capability • SpeedStep technology • Dyamic clock speed reduction • Intel mobile processors include this already • Enhanced SpeedStep used in Core 2 Duo • Controller that turns on sections of the processor as needed. One core can be shut down for single-threaded applications • Power consumption decreased by enhancements to Intel's 65nm process node • use Low-K dielectrics and strained silicon • use low-leakage and "sleep" transistors
Advanced Smart Cache Source: Bit Tech
Advanced Smart Cache Source: Bit Tech • Both cores share data stored in the L2 cache via an arbitration bus unit embedded in the cache. • Dynamically allocates cache space between the two cores, minimising bus traffic by allowing both cores to access one copy of data • Does larger L2 cache matter? • Studies point out that improvements in execution time are low from a 2MB to 4MB for most applications (2-4%)
Smart Memory Access Source: Bit Tech
Smart Memory Access • Improved prefetch units • Memory disambiguation • Allows re-ordering instructions more efficiently Execution with and without memory disambiguation Memory Aliasing Execution without memory disambiguation Example from http://arstechnica.com/articles/paedia/cpu/core.ars/8 Source: Ars Technica
Advanced Digital Media Boost Source: Bit Tech
Advanced Digital Media Boost • Streaming SIMD Extension (SSE) instructions • SSE instructions are an extension of the standard x86 instruction set. • Utilized in multimedia encoding, decoding, image manipulation and encryption • SSE instructions are 128-bit. • Up from 64-bits • Double the SSE performance over previous generation
Comparison of SSE to prior processors Source: Ars Technica
Intel Conroe Vs Presler Conroe Presler • What is the major difference? • Shared L2 versus separate caches Source: Bit Tech
Intel’s Roadmap for Multicore Mobile processors Enterprise processors Desktop processors 8C 12MB shared (45nm) 8C 12MB shared (45nm) QC 8/16MB shared DC 3MB /6MB shared (45nm) DC 3 MB/6 MB shared (45nm) QC 4MB DC 4MB DC 2/4MB shared DC 16MB DC 2/4MB shared DC 2MB DC 4MB SC 1MB DC 2MB DC 2/4MB SC 512KB/ 1/ 2MB 2006 2007 2008 2006 2007 2008 2006 2007 2008 Source: Adapted from Tom’s Hardware • Drivers are • Market segments • More cache • More cores • 80 core processor prototype has been designed!
Intel Chipset Example Source: Extreme Tech
References and Links • http://www.intel.com/products/processor/coreduo/ • http://en.wikipedia.org/wiki/Intel_Core • http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1 • http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/ • http://www.bit-tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop • http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/ • http://www.hardcoreware.net/reviews/review-347-1.htm • http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-Duo-Merom-Notebooks/p1 • http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-Duo-Conroe-E6400-E6600-E6700-X6800/p1 • http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1 • http://arstechnica.com/articles/paedia/cpu/core.ars/1 • http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4 • http://www.extremetech.com/article2/0,1697,1988794,00.asp • http://www.coreduoinfo.com/blog/about-intel-core-duo/ • http://67.91.114.164/intel_c2d_info.htm • http://www.pcper.com/article.php?aid=272&type=expert
Dual Core AMD Opteron Source: AMD
Core 0 1-MB L2 Northbridge 1-MB L2 Core 1 AMD Multicore (Dualcore) Opteron • Two AMD Opteron CPU cores on a single die • Each has 1MB L2 cache • 90nm, ~205 million transistors • Approximately same die size as 130nm single-core AMD Opteron processor • 95 watt power envelope • fits into 90nm power infrastructure • Introduced with “K8” Revision E core in April 2005 Source: AMD
Opteron Core Pipeline Source: Chip Architect
L1 Icache 64KB Fetch Branch Prediction Scan/Align/Decode Microcode Engine Fastpath L1 Dcache 64KB µops Instruction Control Unit (72 entries) FP Decode & Rename Int Decode & Rename 36-entry FP scheduler 44-entry Load/Store Queue Res Res Res AGU AGU AGU FADD FMUL FMISC ALU ALU ALU MULT AMD Opteron Processor Core Architecture Source: The 3D shop
Dual Core AMD Opteron • AMD64 technology • Runs 32-bit applications and is 64-bit capable • Compatible with the x86 software infrastructure • Enables a single architecture across 32- and 64-bit environments • Direct Connect Architecture • NUMA system • Each processor shares its memory with other processors in the system • Integrated Memory Controller on-die • DDR2 DRAM memory controller offers memory BW up to 10.7 GB/s per processor • HyperTransport • Point-to-point interconnect can be used to build a mesh of multiple-processor Opteron systems • Scalable bandwidth interconnect between processors, I/O subsystems, and other chipsets • 24.0 GB/s peak bandwidth per processor
Dual Core AMD Opteron • Not a simple aggregation of K8 cores • Integrated the cores for efficiency • Dual-core Opteron acts very much like a SMP system • Compatible with existing single-threaded, multi-threaded (hyperthreaded) software • MOESI coherency protocol (O – “Owns”) • Updates through system request interface • SSE3 support with 10 new instructions. • Quad-core upgradeability • Hardware assisted AMD Virtualization • Optimized Power Management
Dual Core AMD Opteron Source: Elec Design
AMD Opteron (SOI) Source: Chip Architect
AMD 64 bit Core • 1MB L2 Cache • Detailed discussion of the 64-bit core architecture at: • http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
CPU CPU CPU CPU CPU CPU 8 GB/S SRQ SRQ Crossbar Crossbar Mem.Ctrlr Mem.Ctrlr HT HT 8 GB/S 8 GB/S PCI-E Bridge PCI-E Bridge Memory Controller Hub I/O Hub PCI-E Bridge 8 GB/S Multiprocessor Systems using AMD Opteron PCI-E Bridge I/O Hub PCI-E Bridge PCI-E Bridge I/O Hub USB PCI • Legacy x86 Architecture • CPUs, Memory, I/O all share a bus • Major bottleneck to performance • Faster CPUs or more cores for performance • Symmetric Multiprocessing • AMD64 Direct Connect Architecture • Eliminates FSB bottleneck • HyperTransport™ Technology interconnect for high bandwidth and low latency • Each CPU has its own memory • Each CPU can access the main memory of another processor, transparent to the programmer Different from SMP Source: AMD
Multiprocessor Systems using AMD Opteron Source: XBitlabs
Cache coherency Source: Chip Architect
AMD Athlon 64 X2 Source: AMD
References and Links • http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1 • http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html • http://www.a1-electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml • http://en.wikipedia.org/wiki/Opteron • http://en.wikipedia.org/wiki/Athlon_64_X2 • http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_14309,00.html • http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html • http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp • http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html • http://www.extremetech.com/article2/0,1697,1675784,00.asp • http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991 • http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm • http://www.nextcomputing.com/advantages/thruadv.shtml • http://arstechnica.com/news.ars/post/20060817-7535.html • http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html
SUN – UltraSPARC T1 • Eight cores, each 4-way threaded • 1.2 GHz • Cache • 16K 4-way 32B L1-I • 8K 4-way 16B L1-D • 3MB internal L2 cache partitioned into four banks and four memory controllers. • Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput Source: Sun
SUN – UltraSPARC T1 Source: Sun
SUN – UltraSPARC T1 Pipeline • T1's integer pipeline • Fetch, Thread Selection, Decode, Execute, Memory Access, Writeback Source: Sun
SUN UltraSPARC T2 – Niagara 2 Source: Sun
SUN UltraSPARC T2 • Ultra SPARC T2 has 8 threads/core (8 Sparc Cores) • 8 stage integer pipeline ( as opposed to 6 for T1) • Twice the performance of T1 with a transactional workload (under the same power envelope) • Each thread, increased to 1.4 GHz from 1.2 GHz • One PCI Express port (x8 1.0) • Two 10 Gigabit Ethernet ports with packet classification and filtering • L2 cache size increased to 4 MB shared (8-banks, 16-way associative) • 1 floating point unit per core • Eight encryption engines • Four dual-channel FBDIMM memory controllers • 711 signal I/O,1831 total
UltraSparc T2 Core Microarchitecture Source: Realworld Tech