330 likes | 588 Views
System Architecture: Big Iron (NUMA). Joe Chang jchang6@yahoo.com www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools
E N D
System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com
About Joe Chang • SQL Server Execution Plan Cost Model • True cost structure by system architecture • Decoding statblob (distribution statistics) • SQL Clone – statistics-only database • Tools • ExecStats – cross-reference index use by SQL-execution plan • Performance Monitoring, • Profiler/Trace aggregation
Scaling SQL on NUMA Topics • OLTP – Thomas Kejser session • “Designing High Scale OLTP Systems” • Data Warehouse • Ongoing Database Development • Bulk Load – SQL CAT paper + TK session • “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar
CPU CPU CPU CPU System Bus MCH PXH PXH ICH Symmetric Multi-Processing SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory Access CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory Controller Memory Controller Memory Controller Memory Controller Node Controller Node Controller Node Controller Node Controller Shared Bus or X Bar • NUMA Architecture - Path to memory is not uniform • Node: Processors, Memory, Separate or combined Memory + Node Controllers • Nodes connected by shared bus, cross-bar, ring • Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware
Opteron Opteron Opteron Opteron HT2100 HT2100 HT1100 AMD Opteron Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic)
CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 8-way Opteron Sys Architecture Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other
Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible
NUMA Local and Remote Memory • Local memory is closer than remote • Physical access time is shorter • What is actual access time? • With cache coherency requirement!
HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET
Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line
DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data)
HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data
IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system
Fujitsu R900 4 IOH 14 x8 PCI-E slots, 2 x4, 1x8 internal
44 60 14 56 62 50 62 60 12 30 58 28 46 48 52 54 31 51 49 61 63 45 63 59 55 29 57 61 53 47 13 15 46 24 38 56 26 44 42 10 8 40 42 34 58 40 36 32 25 43 41 57 47 27 37 35 59 41 11 39 43 9 45 33 22 28 52 38 20 24 18 16 26 54 22 36 6 30 4 20 53 29 39 19 27 21 17 55 7 37 25 21 23 23 31 5 50 2 14 8 0 34 6 10 32 4 16 0 48 18 12 2 33 5 9 11 13 1 1 3 17 19 7 15 51 35 49 3 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 1 2 Memory stripe is then spanned across nodes
24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 4 5 12 13 20 21 28 29 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 2 1 Memory stripe is then spanned across nodes
24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 5 13 21 29 4 12 20 28 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 Windows OS NUMA Support • Memory models • SUMA – Sufficiently Uniform Memory Access • NUMA – separate memory pools by Node Memory is striped across NUMA nodes
Memory Model Example: 4 Nodes • SUMA Memory Model • memory access uniformly distributed • 25% of memory accesses local, 75% remote • NUMA Memory Model • Goal is better than 25% local node access • True local access time also needs to be faster • Cache Coherency may increase local access
Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW
Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW
HP-UX LORA • HP-UX – Not Microsoft Windows • Locality-Optimizer Resource Alignment • 12.5% Interleaved Memory • 87.5% NUMA node Local Memory
System Tech Specs Processors Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Operating System • Windows Server 2003 RTM, SP1 • Network limitations (default) • Scalable Networking Pack (912222) • Windows Server 2008 • Windows Server 2008 R2 (64-bit only) • Breaks 64 logical processor limit • NUMA IO enhancements? Impacts OLTP Search: MSI-X Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000
SQL Server version • SQL Server 2000 • Serious disk IO limitations (1GB/sec ?) • Problematic parallel execution plans • SQL Server 2005 (fixed most S2K problems) • 64-bit on X64 (Opteron and Xeon) • SP2 – performance improvement 10%(?) • SQL Server 2008 & R2 • Compression, Filtered Indexes, etc • Star join, Parallel query to partitioned table
Configuration • SQL Server Startup Parameter: E • Trace Flags 834, 836, 2301 • Auto_Date_Correlation • Order date < A, Ship date > A • Implied: Order date > A-C, Ship date < A+C • Port Affinity – mostly OLTP • Dedicated processor ? • for log writer ?