460 likes | 662 Views
Maximizing Desktop Application Performance on Dual-Core PC Platforms. Richard A. Brunner AMD Fellow Advanced Micro Devices. Session Outline. Thread-Level Parallelism Introduction Processor techniques for Thread-Level Parallelism AMD Dual-Core Technology Silicon Basics
E N D
Maximizing Desktop ApplicationPerformance on Dual-Core PCPlatforms Richard A. Brunner AMD Fellow Advanced Micro Devices
Session Outline • Thread-Level Parallelism • Introduction • Processor techniques for Thread-Level Parallelism • AMD Dual-Core Technology • Silicon Basics • How Software detects AMD Dual-Core Technology • Multi-Threaded Programming • Programming Observations & Amdahl’s Law • Relevant Microsoft Windows® API • Licensing • Demo
It’s About Threads (1) … • To understand why AMD Dual-Core technology matters, we have to review some notions about a “thread” • Thread = one sequential set of program steps • Typical program has just 1 sequential set of steps => so typical program is “single-threaded”, e.g: • Thread 0: for (i=0; i<3*N; ++i){a[i]=b[i]*c[i];} • Multi-threading = rewrite program into set of independent steps that can execute in parallel, e.g: • Thread 0: for (i=0; i<N; ++i){a[i]=b[i]*c[i];} • Thread 1: for (i=N; i<2*N; ++i){a[i]=b[i]*c[i];} • Thread 2: for (i=2*N; i<3*N; ++i){a[i]=b[i]*c[i];} • A Process (main program) can have 1 or more threads: • Each thread is independent set of program steps • Has private program counter, stack, and storage • Shares the process address space & attributes with other threads
It’s About Threads (2) … B4 B2 A4 A3 A2 A1 B1 B3 Time Executes Executes Thread Asoftwarecode CPU CPU CPU CPU Thread Bsoftwarecode Executes Executes Not Executing Not Executing Not Executing Not Executing MULTI-TASKING (Time-slicing) on Single-CPU • When an operating system time-slices between programs, it’s actually time-slicing between threads of many programs at once • It doesn’t really matter whether the time-sliced threads are from the same program or not
Improving Thread Performance (1) Memorybandwidth Pipeline Length Frequency DDR Memory Controller L2 Cache AMD 64-bit ProcessorCore L1Instr Cache L1 Data Cache HyperTransport™ CacheSize and Hierarchies MoreExecutionunits I/Obandwidth • To improve performance of programs, improve performance of threads • Lots of tricks have been tried … • Each trick eventually hit a brick wall due to a combination of physics, pricing, and perplexity • Especially want to improve performance on single-processor systems • What’s next? Run more threads in parallel …
Improving Thread Performance (2) • Improve performance of the system and application by running more threads in parallel • i.e., increase “Thread-Level Parallelism” (TLP) • Across a system, this improves throughput (number of jobs per unit time) • Across an application, this reduces application time • Improving TLP requires Software co-operation: • Windows already has lots of threads to run across programs to use these hardware tricks • To benefit even more from TLP, applications need to be re-written to be multi-threaded • We’ll review the issues, later …
Improving Thread Performance (3) • Improving TLP requires CPU (hardware) innovation. Some Techniques we’ll review: • SMP: Symmetric Multi-Processing • SMT: Simultaneous Multi-Threading (Hyperthreading) • CMP: Core Multi-Processing (Dual and Multi Core) • AMD is introducing Dual-Core technology as a hardware method to improve TLP • Allows AMD to: • Keep offering logical, evolutionary performance improvements • Meet system architecture demands of our customer.
Symmetric Multi-Processing B4 A4 A3 A2 A1 B3 Processor 0 Thread Asoftwarecode Executes Processor 1 Thread Bsoftwarecode B1 B2 Executes Single-Core Multi-Processor • Run threads on traditional, single-core multiprocessor system • Each thread can use full CPU resources when executing. • Well-known technique; has been used for decades. • Well supported by Windows • Example: 2-way AMD Opteron™ processor • Great for Servers and Workstations.
Simultaneous Multi-Threading Physical Processor Dedicated SW Regs Dedicated SW Regs 0.5 L1 I-cache 0.5 L1 I-cache ITLB ITLB Logical CPU 1 Execution Units D-Cache, DTLB L2 Cache Logical CPU 0 • Divides hardware resources of one physical processor core into “N” number of mostly independent logical processor “cores” • Operating system views logical cores as “real” cores; schedules threads on each • Uses a slightly-modified SMP model for OS scheduling • So N threads can run in parallel • A “logical” core is implemented as a combination of dedicated and shared “real” hardware in a physical core. • Tries to capitalize on the fact that for some applications, a single thread under-utilizes available processor resources. • Under utilization made worse due to long memory latency • Non-optimal for many real-world applications
Intel HyperThreading Physical Processor Dedicated SW Regs Dedicated SW Regs 0.5 L1 I-cache 0.5 L1 I-cache ITLB ITLB Logical CPU 1 Execution Units D-Cache, DTLB L2 Cache Logical CPU 0 • Intel HyperThreading on Pentium® 4 is SMT: One Physical Processor core partitioned into two logical cores • Each logical core has its own copy of software-visible registers & Instruction TLBs. Each logical CPU gets half of L1 I-Cache • Logical CPUs share DCache, L2, DTLBs, integer & FP units • Sharing can lead to cache-thrashing between threads • Sharing can lead to resource contention for arithmetic units
Core Multi-Processing B4 A3 A2 A1 A4 B3 Core0 Thread Asoftwarecode Executes Core1 Thread Bsoftwarecode B1 B2 Executes Single Processor • Put multiple “real” cores in one processor package • Each thread can use full CPU resources when executing • Operating System schedules threads on each like SMP • Uses a slightly-modified SMP model for OS scheduling • Great For Desktop! • Brings the benefits of SMP at the cost and form-factor of uni-processor desktop • Great for Multiprocessor Servers and Workstations • Brings the benefit of an N-way system at the cost and form factor of an (N/2)-way system • Example: AMD Dual-Core processor
Introducing AMD64 Dual Core Processor Core 0 1-MB L2 Northbridge 1-MB L2 Core 1 • Two AMD Opteron™ CPU cores on one single die, each with 1MB L2 cache • 90nm, ~205 million transistors* • Approximately same die size as 130nm single-core AMD Opteron processor* • 95 watt power envelope fits into 90nm power infrastructure • 940 Socket compatible • AMD expects to be first to introduce dual-core for the one- to eight-processor server and workstation market in mid-2005 • Dual-core processors for client market are expected to follow *Based on current revisions of the design
Designed From The Start To Add Second Core • Shared Northbridge • 3 HyperTransport™ technology links • Dual-channel (128 bit) DDR i/f • AMD Opteron CPU with Direct Connect Architecture was designed as CMP from the start • 2nd port on SRI, request management, 2 APICs • Two complete CPU cores • SMP model • Simpler, less restrictive programming model than ‘logical core’ approach • no need to “pause” one core to give other exclusive use of shared resources Existing AMD64Processor Design 1MB L2 Cache 1MB L2 Cache SRI Core 0 Core 1 X-bar DDR1 DRAMInterface HyperTransport™ Links 0,1,2
Processor versus Core Processor Core Physical packaged die that plugs into a socket on the motherboard that contains 1 or more cores. 1 complete private set of registers, execution units, and retirement queues needed to execute x86 programs; managed & scheduled as single x86 processing resource by the OS. • CPU Numbering scheme uses LSBs of Initial APIC ID to distinguish cores in one processor package. • High-order bits distinguish packages • Initial APIC ID provided by CPUID (eax=1) • Example: 2-Processors/4-Core system means 2 processors populate 2 sockets with 2 cores per processor as above CORE 010 CORE 011 CORE 000 CORE 001
AMD Direct Connect Architecture + Dual-Core 16x16 16x16 Opteron Opteron Opteron Opteron 800 800 800 800 CORE 0 CORE 1 PCI-E PCI-E South Bridge DDR1 • AMD Direct Connect Architecture • Everything connected directly to processor • Reduces system architecture bottlenecks • Further reduces latency by directly connecting two cores on same die • Demo of AMD Opteron™ dual-core processor-based systems on August 31, 2004 • World’s first demonstration of x86-class dual-core processor • 4 processor/8 core systems running Windows® Chipset
Traditional FSB System Architecture PCI-X Bridge Server Processor Memory access delayed by passing through northbridge I/O & memory compete for CPU’s FSB B/W North Bridge PCI-X DDR DDR B/W bottlenecks: link B/W < I/O device B/W More ChipsNeeded forBasic Server South Bridge IDE, FDC, USB, Etc. PCI
AMD64 Processor with Direct Connect Architecture AMD Opteron™ Processor DDR HyperTransport™technology for glueless I/O or CPU expansion DDR HyperTransport bus has ample bandwidth for I/O devices Separate memory and I/O paths eliminate most bus contention PCI-X / PCIe Bridge Fewer chips needed for basic server AMD-8111™I/O Hub IDE, FDC, USB, Etc. PCI
Traditional FSB System Architecture • System scalability limited by Northbridge • Max of 4 processors • Processors compete for FSB bandwidth • Memory size and bandwidth are limited • Max of 3 PCI-X bridges • Many more chips required Processor Processor Processor Processor PCI-X Bridge DDR PCI-X Memory Expander North Bridge PCI-X Bridge PCI-X DDR Memory Expander PCI-X Bridge PCI-X IDE, FDC, USB, Etc. South Bridge PCI
800-Series AMD Opteron™ Processor-based Server DDR DDR cHT [1] AMD Opteron™ AMD Opteron™ 144-Bit Reg DDR • Idle Latencies to First Data* • 1P System: <59ns • 0-Hop in 4P System: ~85ns • 1-Hop in 4P System: <95ns • 2-Hop in 4P System: <127ns cHT [1] cHT [1] DDR DDR cHT [1] AMD Opteron™ AMD Opteron™ HT [3] HT [2] PCI-33 AMD- 8111™ VGA AMD- 8131™ PCI-X PCI-X PCI-X LPC FLSH 64-bits @ 133MHz 64-bits @ 133MHz USB ENET AC97 IDE BMC HT[4] SIO 64-bits @ 66MHz AMD- 8131™ PCI-X Gbit Enet PCI-X SCSI [1] = 16x16 Coherent HyperTransport™ @ 2000MT/s PCI-X [2] = 16x16 HyperTransport @ 2000MT/s Gbit Enet [3] = 8x8 HyperTransport @ 400MT/s [4] = 8x8 HyperTransport @ 1600MT/s (2.8GHz CPU, 200MHz PC3200 DRAM (closed page) , 1000MHz HT)
SSE3 Support AMD dual-core processors are designed to support SSE3 Supports SSE3 instructions reported by CPUID.SSE3 feature flag 10 new SSE instructions and 1 new x87 instruction (13 total opcodes) No Monitor or Mwait for Hyperthreading which have separate CPUID flag anyway ADDSUB[PD,PS] xmm1, xmm2/m128 Provides interleaved packed add and subtract FISTTP m16int/m32int/m64int Like FISTP but with forced Truncation HADD[PD,PS] xmm1, xmm2/m128 Horizontal Adds HSUB[PD,PS] xmm1, xmm2/m128 Horizontal Subtracts LDDQU xmm, m128 Special 128-bit Unaligned Load MOV[DD,SHD,SLD]UP xmm1, xmm2/m64 Move and Duplicate some elements
How Can Software Detect AMD Dual-Core? • Same steps as detecting SMP or SMT on x86/AMD64 • OS Kernel uses information from BIOS or reads special hardware registers to get number of CPUs (cores) in system • Each core has unique APIC ID assigned by BIOS • BIOS records CPU info in ACPI-MADT and MPS tables • BIOS records MP topology info in ACPI-SRAT and ACPI-SLIT • OS and App code also need to determine the number of physical (or logical) cores per processor • Information is key for efficient thread scheduling and memory allocation • Existing (legacy) software only expects logical cores. It uses the x86 “CPUID” instruction to get that number • So, AMD reports physical cores as logical cores for this form of CPUID. Lets legacy software exploit physical cores w/o change.
Legacy CMP/CPUID Support • Legacy software uses CPUID (eax=1) to get number of logical cores. AMD’s CPUID reports physical cores in same way: • CPUID.HTT=1 (edx[28]) • CPUID.logical_number_of_processors = 2 (ebx[23:16]) • Legacy software support for 2-logical cores, while more restrictive, appears to work equally fine for 2-physical cores • Hyperthreading scheduling rules work fine for multi-core • AMD has tested this model heavily with legacy software and expects no major problems • Migrating from hyperthreading rules to less restrictive multi-core rules becomes an optimization, not a requirement • New extended CPUID Feature bit, LEGACY_CMP, tells new software if the HTT fields above report Hyperthreading • LEGACY_CMP will be ‘1’ on AMD dual core indicating no HTT support
Operating System Support for Dual-Core • Windows XP Home, Windows XP Pro, Windows 2003 (32-bit and 64-bit) support AMD Dual-Core using CMP Legacy Mode • First AMD dual-core silicon using this model booted Windows® within hours • Recent OS distributions that support Hyperthreading are expected to work well • New extended CPUID function (eax=8000_0008) returns on any core the number of physical cores per processor • Correct way for future OS and application software
To Thread or Not To Thread? • Good Dual-Core processor technology needs good software to exploit it to the fullest • Modern OS software already understands SMP and will run more programs in parallel on Dual-Core • Most desktop OS software derives from SMP-capable server OS software. • Leads to Higher though-put across multiple programs for Desktop • SMP (multi-threading) programming model is well understood in server/workstation markets • Lots of “embarrassingly parallel” problems and plenty of programmer experience • So Dual-Core will be exploited naturally here
Multi-threading Challenges for Desktop (1) • SMP/multi-threading programming models now become relevant to consumer desktop software • Higher performance for a single-program that “decomposes” • But SMP is a new realm for desktop software • Desktop apps were once not suitable for multi-threading. Now they are proliferating on the desktop: • Much of the complexity of multi-threading is hidden by Microsoft CLR, C#, VB.NET, and Java
Multi-threading Challenges for Desktop (2) • Desktop benefits from multi-cores by being able to run more applications/tasks in parallel • A single desktop application can also use multi-threading on a multi-core to do: • Multimedia CODECs • Games (through double-buffering strategies) • Productivity apps (background threads do complex processing while waiting for user input) • Speech and handwriting recognition • Prosumer digital content creation apps • Anti-virus software • GUI to give appearance of responsiveness
Multi-threading Challenges for Desktop (3) • The inhibitor: traditional desktop apps are complex, non-threaded legacy code; code needs to be totally re-written • Desktop Software Developers have little experience yet writing multi-threaded code • Problem domain for scientific applications is often regular: • Compilers and tools can often find cases to generate parallel threads automatically • These tools also support a rich set of directives for the programmer to explicitly create threads in a language-specific way
Multi-threading Challenges for Desktop (4) • Desktop developers often have to manually decompose their complex application into threads • The mechanics of threading is being made easier by “new” language environments • The decomposition analysis is the hard part • Traditional “Data-Parallelism” approach doesn’t fit desktop apps as well • Task-Level Parallelism likely will fit better • E.G. In a game have threads for physics, audio, graphics, and strategy
Amdahl’s Law (1) Original Threaded Code0 (S) Code0(S) Time Code1 Code1p0 Code1 p1 -1 Speed Up = S + 1-S P • Decomposing non-threaded program produces “serial” piece & some “parallel” pieces • Parallel pieces can be threaded • Use Amdahl’s Law to estimate benefit of threading • Let “S” be percentage of execution time of serial parts of serial-version of program • Let “P” be number of threads issued on same number of CPU cores • Example: • program spends 25% of execution time in serial portion of serial version, then Speed UP on a dual-core could be 1.6x
Amdahl’s Law (2) Original Threaded Code0 (S) Code0(S) Time Code1 Code1p0 Code1 p1 -1 Speed Up = S + 1-S P • Example: • program spends 75% of execution time in serial portion of serial version, then Speed UP on a dual-core could be 1.14x • Know Thy Program … • Amdahl’s law assumes overall structure of program doesn’t change going from serial serial+parallel • If structure of serial+parallel is drastically different (and more efficient) you may do better than Amdahl’s law
General Programming Observations (1) • Experiment, but, for K-core system, schedule N threads, where N is: K-2 <= N <= K+2 • Use OS APIs (or CPUID functions) to test for number of cores • Using CPUID Hyperthreading fields to determine presence and number of cores works well • Works well for logical and physical cores • AMD follows generally accepted CPUID standard • Therefore, no need to test for “CPU vendor” before using CPUID Hyperthreading fields • Avoid heavy threading on single-core system • Threading has some small OS/Program overhead • May lead to threaded program running slower than serial version • Application should test at install time or run-time to determine if the number of available processors allow threading
General Programming Observations (2) • Don’t bother trying to do explicit binding of threads to cores, Windows does a fine job automatically • If you are in High-Performance-Computing, feel free to ignore the above advice … • Windows memory allocation mechanism tries to optimize memory affinity even in large NUMA multi-core systems
VirtualAlloc( Address, Size, AllocationType, Protect ) • Standard Win32 API to allocate process virtual memory • Reserves (Allocates) or Reserves-and-Commits Virtual Memory • Does the right thing for NUMA and multi-core on Microsoft Windows 2003 SP1 despite the MSDN documentation … • Per MSDN: AllocationType = MEM_RESERVE • Just inserts a node into process’ VirtualAddressDescriptor tree • Does not map the virtual page to any physical page; this requires process to follow up with a commit later to the same region • Per MSDN: AllocationType = MEM_COMMIT • According to MSDN, reserves as above and commits • Commitment implies mapping Virtual Pages to zero’d-out physical pages
VirtualAlloc( … ) Continued • What really happens for AllocationType = MEM_COMMIT • Just inserts a node (entry) into the process VirtualAddressDescriptor tree • Commits page(s) only in that no explicit MEM_COMMIT is required afterwards • Later when (if) the thread accesses the page(s), page fault handler finds the node (entry) in VAD tree and tries to allocate physical memory • Attempts to allocate physical memory from the NUMA node the thread is currently on at the time of the page fault • If not available on that NUMA node, grabs it from another • Means a parent can allocate memory for child threads and still maintain desired thread-to-memory affinity • Affinitizing your threads increases chances that at page fault time, thread will be on the desired processor
Global Heap • The global heap operates under the same rules (calls VirtualAlloc too) • the only difference may be when they decide to access the virtual addresses (preload or wait till later) • Avoid Global Heap by using Process-Private heaps: standard recommendation for threading • HeapAlloc() and friends also use VirtualAlloc() • Use Local, Low Fragmentation Heap • Global/Local Heap Algorithms are too complex to predict how various heap memory is affinitized for default Heap
Low Fragmentation Heap (LFH) • Added to Windows XP and Windows Server 2003 • An application can use LFH for private heaps • Alleviates a lot of global heap unknown affinity issues on a NUMA Multi-core system • Use HeapAlloc() followed by HeapSetInformation() to spec LFH • LFH Allocations are grouped in buckets and are local to a virtual slot. Assignment of one thread to a particular slot is done automatically when contention is detected. • Note that the C runtime heap is LFH by default on 64-bit platforms starting with Windows 2003 SP1 • Malloc & free should show significant improvement on 64-bit NUMA, multi-core systems • LFH has to be explicitly set for 32-bit Windows
CodeAnalyst Thread Analysis • Identities threads in the target application • Shows thread creation and termination • Monitors CPU affinity of each thread • Identifies non-local memory access • Graphs thread activity on each CPU
Miscellaneous Windows NUMA APIs • GetNumaHighestNodeNumber() = retrieves highest numbered node in system, but doesn’t guarantee that nodes are sequentially numbered. • GetProcessAffinityMask() = retrieves the list of processors on the system • GetNumaProcessorNode() = returns NUMA node for specified processor. • GetNumaNodeProcessorMask() = retrieves list of all processors for a node. • SetProcessAffinityMask() = sets affinity for all threads in process. • SetThreadAffinityMask() = sets affinity for individual thread. • GetNumaAvailableMemoryNode() = retrieve the amount of memory available to a node.
Dual-Core & Software Licensing Trends Multi-core processors - Do you count cores? Virtualization - License by virtual machine? Hosted computing - Calculate time and resources? The software industry is facing a shift in pricing & licensing models that rely on a traditional view of hardware technology. Customers are demanding software licensing to reflect the amount of work done, not the characteristics of the processor. AMD recommends that ISVs that license by processor continue to do so, rather than switching to licensing by processor core. • This helps ensure seamless software compatibility with existing x86 and AMD64 operating systems and apps -- whether they are single-threaded or multi-threaded.
Software Licensing Trends • Recently, Microsoft has clarified its software licensing position with respect to processor and cores. • Microsoft server software that is currently licensed by number of processors on the server will continue to be licensed in that model • for server hardware that contains dual-core and multi-core processors. • This policy helps ensure that customers will not incur additional software licensing requirements or fees when they choose to adopt multi-core processor technology. AMD applauds this decision which will help make this new enterprise computing technology affordable to customers, including mid-size and small businesses.
Some References for Multi-Threading • Programming Windows, Fifth Edition , by Charles Petzold • Microsoft Press, ISBN: 157231995X • Multithreading Applications in Win32: The Complete Guide to Threads, by Beveridge & Wiener, • Addison-Wesley Professional; ISBN: 0201442345 • Multithreaded Programming with Win32, by Pham & Garg • Prentice Hall PTR; ISBN: 0130109126
AMD, the AMD Arrow logo, AMD Opteron, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Pentium is a registered trademark of Intel Corporation in the U.S. and/or other jurisdictions. Microsoft and Windows are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. HyperTransport is a trademark of the HyperTransport Technology Consortium. Other licensed names are for informational purposes only and may be trademarks of their respective owners.