System Architecture: Big Iron (NUMA)

System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com

About Joe Chang • SQL Server Execution Plan Cost Model • True cost structure by system architecture • Decoding statblob (distribution statistics) • SQL Clone – statistics-only database • Tools • ExecStats – cross-reference index use by SQL-execution plan • Performance Monitoring, • Profiler/Trace aggregation

Scaling SQL on NUMA Topics • OLTP – Thomas Kejser session • “Designing High Scale OLTP Systems” • Data Warehouse • Ongoing Database Development • Bulk Load – SQL CAT paper + TK session • “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar

Server Systems and Architecture

CPU CPU CPU CPU System Bus MCH PXH PXH ICH Symmetric Multi-Processing SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

Non-Uniform Memory Access CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory Controller Memory Controller Memory Controller Memory Controller Node Controller Node Controller Node Controller Node Controller Shared Bus or X Bar • NUMA Architecture - Path to memory is not uniform • Node: Processors, Memory, Separate or combined Memory + Node Controllers • Nodes connected by shared bus, cross-bar, ring • Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware

Opteron Opteron Opteron Opteron HT2100 HT2100 HT1100 AMD Opteron Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic)

CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 8-way Opteron Sys Architecture Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other

http://www.techpowerup.com/img/09-08-26/17d.jpg

Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible

NUMA Local and Remote Memory • Local memory is closer than remote • Physical access time is shorter • What is actual access time? • With cache coherency requirement!

HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET

Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data)

HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency

Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data

IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system

Fujitsu R900 4 IOH 14 x8 PCI-E slots, 2 x4, 1x8 internal

44 60 14 56 62 50 62 60 12 30 58 28 46 48 52 54 31 51 49 61 63 45 63 59 55 29 57 61 53 47 13 15 46 24 38 56 26 44 42 10 8 40 42 34 58 40 36 32 25 43 41 57 47 27 37 35 59 41 11 39 43 9 45 33 22 28 52 38 20 24 18 16 26 54 22 36 6 30 4 20 53 29 39 19 27 21 17 55 7 37 25 21 23 23 31 5 50 2 14 8 0 34 6 10 32 4 16 0 48 18 12 2 33 5 9 11 13 1 1 3 17 19 7 15 51 35 49 3 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 1 2 Memory stripe is then spanned across nodes

24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 4 5 12 13 20 21 28 29 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 2 1 Memory stripe is then spanned across nodes

24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 5 13 21 29 4 12 20 28 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 Windows OS NUMA Support • Memory models • SUMA – Sufficiently Uniform Memory Access • NUMA – separate memory pools by Node Memory is striped across NUMA nodes

Memory Model Example: 4 Nodes • SUMA Memory Model • memory access uniformly distributed • 25% of memory accesses local, 75% remote • NUMA Memory Model • Goal is better than 25% local node access • True local access time also needs to be faster • Cache Coherency may increase local access

Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW

HP-UX LORA • HP-UX – Not Microsoft Windows • Locality-Optimizer Resource Alignment • 12.5% Interleaved Memory • 87.5% NUMA node Local Memory

System Tech Specs Processors Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots

Software Stack

Operating System • Windows Server 2003 RTM, SP1 • Network limitations (default) • Scalable Networking Pack (912222) • Windows Server 2008 • Windows Server 2008 R2 (64-bit only) • Breaks 64 logical processor limit • NUMA IO enhancements? Impacts OLTP Search: MSI-X Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000

SQL Server version • SQL Server 2000 • Serious disk IO limitations (1GB/sec ?) • Problematic parallel execution plans • SQL Server 2005 (fixed most S2K problems) • 64-bit on X64 (Opteron and Xeon) • SP2 – performance improvement 10%(?) • SQL Server 2008 & R2 • Compression, Filtered Indexes, etc • Star join, Parallel query to partitioned table

Configuration • SQL Server Startup Parameter: E • Trace Flags 834, 836, 2301 • Auto_Date_Correlation • Order date < A, Ship date > A • Implied: Order date > A-C, Ship date < A+C • Port Affinity – mostly OLTP • Dedicated processor ? • for log writer ?

System Architecture: Big Iron (NUMA)

System Architecture: Big Iron (NUMA)

Presentation Transcript

SQL Server Scaling on Big Iron (NUMA) Systems

Memory System Performance in a NUMA Multicore Multiprocessor

“Big” and “Not so Big” Iron at SNL

Memory System Performance in a NUMA Multicore Multiprocessor

System Architecture

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads

System Architecture

Numa Pompiliuis

Big Iron and Parallel Processing

Computer architecture II

Operating System Support for improving data locality on CC-NUMA machines

System Architecture

System Architecture

System Architecture: Big Iron (NUMA)

Reactive NUMA

CMSC 611: Advanced Computer Architecture

DDM – A Cache-Only Memory Architecture

Operating System Support for improving data locality on CC-NUMA machines

DDM – A Cache-Only Memory Architecture