The x86 Server Platform

The x86 Server Platform .. Resistance is futile…. Dec 6, 2004

Server shipments – Total vs x86

Market Share: Servers, United States, 2Q04 Michael McLaughlin, Market Share: Servers, United States, 2Q04 7 October 2004, Gartner

x86 Platform CPUs • Intel • Xeon MP – Gallatin (future is Potomac) • Xeon SP/DP – EM64T - Nacona • Itanium II MP – Madison (future is Montecito) • AMD • Opteron

130 nm 3 GHz 4 MB L3 Cache FSB - 400 MHz Gallatin - MP

ES7000 – 32 Gallatins

90 nm Clock Speed – 3.2-3.6 GHz L3 – 4 MB FSB – 800 Mhz Nacona – Single Processor with EM64T

130 nm 9 MB L3 cache 1.6 GHz FSB – 400 MHz Itanium II - Madison

STOP • Why Multi-Core? • .. And while we’re at it, why Multi-Threading? • It’s all about the balance of • Silicon real estate • Compiler technology • Cost • Power …. to meeting the constant pressure to double performance every 18 months

Memory Latency vs CPU Speed MicroprocessorOperating Frequency (GHz) DRAM AccessFrequency (10-9 sec)-1 10.0 10.0 1.0 1.0 Microprocessor on-chip clock Commodity DRAM 0.1 0.1 0.01 0.01 1990 1995 2000 2005 2010 Production Year

Processor Architecture • When latency ↓ Ø and bandwidth ↑∞we will have the perfect CPU • A great deal of innovation has centered around approximating this perfect world • CISC • CPU Cache • RISC • EPIC • Multi-Threading • Multiple Cores

Complex Instruction Set Computer • Hardware implements assembler instructions • MULT A, B • hardware loads registers, multiplies and stores results • Multiple clocks needed for an instruction • RAM requirements are relatively small • Compilers translate high level languages down to assembler instructions – Von Neumann hardware http://www.hardwarecentral.com/hardwarecentral/tutorials/2427

CPU Cache • When CPU speeds started to increase, memory latency emerged as a bottleneck • CPU caches were used to keep local references “close” to the CPU • For SMP systems, memory banks were more than a clock away • It is not uncommon today to find 3 orders of magnitude between the fastest and slowest memory latency

Reduced Instruction Set Computer • Hardware is simplified – fewer transistors are needed for full instruction set • RAM requirements are higher to store intermediate results and more code • Compilers are more complex • Clock speeds increase because instructions are simpler • Deterministic, simple instructions allow pipelining

Pipelining 25% busy Higher Clock Speeds! 100% busy 80% busy 60% busy 40% busy

Branch Prediction • While processing in parallel, branches occur • Branch prediction is used to increase the probability that a specific branch will be followed • If incorrect, the pipeline is “dead” and the CPU stalls • Statistics • 10%-20% of instructions are branches • Predictions are incorrect about 10% of the time • As the pipeline increases, probability of miss increases and cycles will be discarded • 80-deep pipeline / 20% branches / 10% miss => 80% chance of miss and a penalty of 80 cycles

Itanium II Epic Instruction SetExplicitly Parallel Instruction Computing • Compiler can indicate code that can be executed in parallel • Both branches are pipelined • No lost cycles due to miss-prediction • Pipeline can be deeper • Complexity continues to move into the compiler

Multi-Threading

Multiple Cores • Fabrication sizes continue to diminish • The additional real estate has been used to put more and more memory on the die • Multi-core technology provides a new way to exploit the additional space • The clock rates cannot continue to climb due to the excessive heat • P = C * V2 * f C - switch capacitance V – Supply Voltage f – clock frequency • Multiple cores is the next step to providing faster execution times for applications

(End of 2005?)

130 nm Clock Speed – 1.4-2.4 GHz L2 – 1 MB 6.4 GB/s Hypertransport AMD Opteron 800 Series

Architectural Comparison Hypertransport™ - 6.4 GB/s Opteron Opteron Xeon Xeon Xeon Xeon 6.4 GB/s Opteron Opteron PCI-XBridge MemoryAddressBuffer DDR 144-bit PCI-XBridge SNC PCI-XBridge PCI-XBridge MemoryAddressBuffer PCI-XBridge I/OHub OtherBridge MemoryAddressBuffer I/OHub MemoryAddressBuffer

Mapping Workloads onto Architecture • Consider a dichotomy of workloads: • Large Memory Model – This needs a large, single system image and a large amount of coherent memory • Database apps - SQL Server / Oracle • Business Intelligence – Data Warehousing + Analytics • Memory-resident databases • 64 bit architectures allow memory addressability above 1 TB • Small/Medium Memory Model – This can be cost-effective in workloads that do not require extensive shared memory/state • Stateless Applications and Web Services • Web Servers • Clusters of systems for parallelized applications and grids

Large Server Vendors • Intel Announcement (Nov 19) Otellini said product development, marketing and software efforts (for Itanium) will all now be aimed at "greater than four-way systems". He also said, "The mainframe isn't dead. That's where I'd like to push Itanium over time." • The size of the SMP is affected by Intel’s chip set support for coherent memory • OEM Vendors (Unisys, HP, SGI, Fujitsu, IBM) • Each has unique “chip set” to build basic four-ways into large SMP systems • IBM has Power5, which is a direct competitor • Intel 32-bit and EM674T • This could emerge as the flagship product

Where Are We Going? • Since the early CISC computers, we have moved more and more of the complexity out to the compiler to achieve parallelism and fully exploit the silicon “real estate” • The power requirements, along with the smaller fabrication sizes, have pushed the CPU vendors to exploit multiple cores • The key to performance for these future machines will be the application’s ability to exploit parallelism

The x86 Server Platform

The x86 Server Platform

Presentation Transcript

Interrupt in Sandy Bridge and x86 platform Taeweon Suh

Interrupt in Sandy Bridge and x86 platform Taeweon Suh

High-Performance Computing on the Windows Server Platform

Windows Server Platform: Overview and Roadmap

Oracle VM Server Virtualization (x86) Overview

Cisco Unified Computing: the revolution as an x86 virtualization platform

x86 assembly

Server Platform Competency Attainment Training

The x86 Feature Flags

x86 Assembly

uC/OS-II Porting to Intel X86 Platform

The x86 Instruction Set

CodeSurfer / x86 A Platform for Analyzing x86 Executables

x86 MMU

x86, Assembler

The Athlons x86 Architecture

X86 Architecture

The various x86 ‘modes’

The x86 Feature Flags

The various x86 ‘modes’