2.54k likes | 2.55k Views
Explore the system architecture of NUMA (Non-Uniform Memory Access), including its impact on memory latency and performance. Learn about CPU configurations, memory controllers, shared buses, and the differences between NUMA and SMP systems.
E N D
System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com
About Joe Chang • SQL Server Execution Plan Cost Model • True cost structure by system architecture • Decoding statblob (distribution statistics) • SQL Clone – statistics-only database • Tools • ExecStats – cross-reference index use by SQL-execution plan • Performance Monitoring, • Profiler/Trace aggregation
Scaling SQL on NUMA Topics • OLTP – Thomas Kejser session • “Designing High Scale OLTP Systems” • Data Warehouse • Ongoing Database Development • Bulk Load – SQL CAT paper + TK session • “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar
CPU CPU CPU CPU System Bus MCH PXH PXH ICH Symmetric Multi-Processing SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory Access CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory Controller Memory Controller Memory Controller Memory Controller Node Controller Node Controller Node Controller Node Controller Shared Bus or X Bar • NUMA Architecture - Path to memory is not uniform • Node: Processors, Memory, Separate or combined Memory + Node Controllers • Nodes connected by shared bus, cross-bar, ring • Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware
Opteron Opteron Opteron Opteron HT2100 HT2100 HT1100 AMD Opteron Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic)
CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 8-way Opteron Sys Architecture Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other
Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible
NUMA Local and Remote Memory • Local memory is closer than remote • Physical access time is shorter • What is actual access time? • With cache coherency requirement!
HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET
Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line
DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data)
HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data
IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system
24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 4 5 12 13 20 21 28 29 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 2 1 Memory stripe is then spanned across nodes
24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 5 13 21 29 4 12 20 28 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 Windows OS NUMA Support • Memory models • SUMA – Sufficiently Uniform Memory Access • NUMA – separate memory pools by Node Memory is striped across NUMA nodes
Memory Model Example: 4 Nodes • SUMA Memory Model • memory access uniformly distributed • 25% of memory accesses local, 75% remote • NUMA Memory Model • Goal is better than 25% local node access • True local access time also needs to be faster • Cache Coherency may increase local access
Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW
Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW
HP-UX LORA • HP-UX – Not Microsoft Windows • Locality-Optimizer Resource Alignment • 12.5% Interleaved Memory • 87.5% NUMA node Local Memory
System Tech Specs Processors Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Operating System • Windows Server 2003 RTM, SP1 • Network limitations (default) • Scalable Networking Pack (912222) • Windows Server 2008 • Windows Server 2008 R2 (64-bit only) • Breaks 64 logical processor limit • NUMA IO enhancements? Impacts OLTP Search: MSI-X Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000
SQL Server version • SQL Server 2000 • Serious disk IO limitations (1GB/sec ?) • Problematic parallel execution plans • SQL Server 2005 (fixed most S2K problems) • 64-bit on X64 (Opteron and Xeon) • SP2 – performance improvement 10%(?) • SQL Server 2008 & R2 • Compression, Filtered Indexes, etc • Star join, Parallel query to partitioned table
Configuration • SQL Server Startup Parameter: E • Trace Flags 834, 836, 2301 • Auto_Date_Correlation • Order date < A, Ship date > A • Implied: Order date > A-C, Ship date < A+C • Port Affinity – mostly OLTP • Dedicated processor ? • for log writer ?
Storage Performance for Data Warehousing Joe Chang jchang6@yahoo.com www.qdpma.com
About Joe Chang • SQL Server Execution Plan Cost Model • True cost structure by system architecture • Decoding statblob (distribution statistics) • SQL Clone – statistics-only database • Tools • ExecStats – cross-reference index use by SQL-execution plan • Performance Monitoring, • Profiler/Trace aggregation
Organization Structure • In many large IT departments • DB and Storage are in separate groups • Storage usually has own objectives • Bring all storage into one big system under full management (read: control) • Storage as a Service, in the Cloud • One size fits all needs • Usually have zero DB knowledge Of course we do high bandwidth, 600MB/sec good enough for you?
Data Warehouse Storage • OLTP – Throughput with Fast Response • DW – Flood the queues for maximum through-put Do not use shared storage for data warehouse! Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else. My advice: stay away from shared storage, controlled by different team.
Nominal and Net Bandwidth • PCI-E Gen 2 – 5 Gbit/sec signaling • x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net • SAS 6Gbit/s – 6 Gbit/s • x4 port: 3GB/s nominal, 2.2GB/sec net? • Fibre Channel 8 Gbit/s nominal • 780GB/s point-to-point, • 680MB/s from host to SAN to back-end loop • SAS RAID Controller, x8 PCI-E G2, 2 x4 6G • 2.8GB/s Depends on the controller, will change!
SAS x4 SAS x4 RAID PCI-E x8 SAS x4 SAS x4 SAS x4 SAS x4 RAID PCI-E x8 SAS x4 SAS x4 RAID PCI-E x8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RAID PCI-E x8 RAID PCI-E x4 SAS x4 2 x10GbE PCI-E x4 PCI-E x4 2 x10GbE Storage – SAS Direct-Attach Many Fat Pipes Very Many Disks Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller Option B: Split enclosure over 2 x4 SAS ports, 1 controller Balance by pipe bandwidth Don’t forget fat network pipes
8Gb FC 8Gb FC HBA 8Gb FC 8Gb FC PCI-E x8 8Gb FC 8Gb FC 8Gb FC 8Gb FC HBA PCI-E x8 8Gb FC 8Gb FC HBA PCI-E x4 8Gb FC 8Gb FC . . . . . . 8Gb FC 8Gb FC 8Gb FC . . . . . . HBA . . PCI-E x4 8Gb FC 8Gb FC 8Gb FC . . . . . . . . . . . . . . . . HBA PCI-E x4 . . HBA HBA PCI-E x4 PCI-E x4 2 x10GbE PCI-E x4 PCI-E x4 2 x10GbE Storage – FC/SAN PCI-E x8 Gen 2 Slot with quad-port 8Gb FC If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS! SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density. 1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?
SAS x4 SAS x4 SAS SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD PCI-E x8 SAS x4 SAS x4 SAS x4 SAS x4 SAS PCI-E x8 SAS x4 SAS x4 . . . . . . . . SAS PCI-E x8 . . . . . . . . . . . . . . . . . . . . . . . . SAS PCI-E x8 PCI-E x4 RAID SAS x4 2 x10GbE PCI-E x4 PCI-E x4 2 x10GbE Storage – SSD / HDD Hybrid No RAID w/SSD? Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs
SSD • Current: mostly 3Gbps SAS/SATA SDD • Some 6Gbps SATA SSD • Fusion IO – direct PCI-E Gen2 interface • 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s • No RAID ? • HDD is fundamentally a single point failure • SDD could be built with redundant components • HP report problems with SSD on RAID controllers, Fujitsu did not?
Big DW Storage – iSCSI • Are you nuts?
Storage Configuration - Arrays Shown: two 12-disk Arrays per 24-disk enclosure Options: between 6-16 disks per array SAN systems may recommend R10 4+4 or R5 7+1 Very Many Spindles Comment on Meta LUN
Processors GHz Total Cores Mem GB SQL Q1 sec SF Total MB/s MB/s per core 2 Xeon 5355 2.66 8 64 5sp2 85.4 100 1,165.5 145.7 Conroe 2 Xeon 5570 2.93 8 144 8sp1 42.2 100 2,073.5 259.2 Nehalem 2 Xeon 5680 3.33 12 192 8r2 21.0 100 4,166.7 347.2 Westmere 4 Xeon 7560 8 Xeon 7560 2.26 2.26 32 64 640 512 8r2 8r2 37.2 183.8 3000 300 7,056.5 14,282 220.5 223.2 Neh.-EX Data Consumption Rate: Xeon TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.
8 Opt 8439 8 Opt 8439 2.8 2.8 48 48 256 512 8sp1 8rtm 166.9 49.0 300 1000 5,242.7 5,357.1 109.2 111.6 2 Opt 6176 4 Opt 6176 2.3 2.3 24 48 192 512 8r2 8r2 20.2 31.8 300 100 4,331.7 8,254.7 180.5 172.0 - Magny-C Data Consumption Rate: Opteron TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors GHz Total Cores Mem GB SQL Q1 sec SF Total MB/s MB/s per core 4 Opt 8220 2.8 8 128 5rtm 309.7 300 868.7 121.1 8 Opt 8360 2.5 32 256 8rtm 91.4 300 2,872.0 89.7 Barcelona 8 Opt 8384 2.7 32 256 8rtm 72.5 300 3,620.7 113.2 Shanghai Istanbul Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?
Processors GHz Total Cores Mem GB SQL Q1 sec SF Total MB/s MB/s per core 2 Xeon 5355 2.66 8 64 5sp2 85.4 100 1165.5 145.7 2 Xeon 5570 2.93 8 144 8sp1 42.2 100 2073.5 259.2 2 Xeon 5680 3.33 12 192 8r2 21.0 100 4166.7 347.2 2 Opt 6176 2.3 24 192 8r2 20.2 100 4331.7 180.5 4 Opt 8220 2.8 8 128 5rtm 309.7 300 868.7 121.1 8 Opt 8360 2.5 32 256 8rtm 91.4 300 2872.0 89.7 8 Opt 8384 2.7 32 256 8rtm 72.5 300 3620.7 113.2 8 Opt 8439 2.8 48 256 8sp1 49.0 300 5357.1 111.6 4 Opt 6176 2.3 48 512 8r2 31.8 300 8254.7 172.0 8 Xeon 7560 2.26 64 512 8r2 183.8 3000 14282 223.2 Data Consumption Rate TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Barcelona Shanghai Istanbul Magny-C
Processors BW Core Total Cores Target MB/s PCI-E x8-x4 SAS HBA Storage Units/Disks Actual Bandwidth Storage Units/Disks 2 Xeon X5680 350 12 4200 5 - 1 2 2 - 48 5 GB/s 4 - 96 4 Opt 6176 8400 175 48 5 - 1 4 4 - 96 10 GB/s 8 - 192 4 Xeon X7560 8000 32 250 6 - 4 6 6 - 144 15 GB/s 12 - 288 8 Xeon X7560 225 64 14400 9 - 5 11† 10 - 240 26 GB/s 20 - 480 Storage Targets 2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk † 8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk 24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk, possible but not always practical 24 disks per x4 SAS port requires 50MB/sec, more achievable in practice Think: Shortest path to metal (iron-oxide)
Model Disks BW (KB/s) Sequential IOPS “Random” IOPS Sequential- Rand IO ratio Optimizer - 10,800 1,350 320 4.22 SAS 2x4 24 2,800,000 350,000 9,600 36.5 SAS 2x4 48 2,800,000 350,000 19,200 18.2 FC 4G 30 360,000 45,000 12,000 3.75 SSD 8 2,800,000 350,000 280,000 1.25 Your Storage and the Optimizer Assumptions 2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8 HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction
Data Consumption Rates TPC-H SF100 Query 1, 9, 13, 21 TPC-H SF300 Query 1, 9, 13, 21