1.26k likes | 1.52k Views
NSYSU p595 Introduction. March , 200 8. Contents. POWER5/POWER6 Architecture Compiler Options Large / Medium Page Libraries OpemMP usage MPI & PE usage LoadLeveler usage Performance Tunning. ≥3.5GHz Cores. ≥3.5GHz Cores. AltiVec. AltiVec. ≥3.5GHz Core. ≥3.5GHz Core.
E N D
NSYSU p595 Introduction March, 2008
Contents • POWER5/POWER6 Architecture • Compiler Options • Large / Medium Page • Libraries • OpemMP usage • MPI & PE usage • LoadLeveler usage • Performance Tunning
≥3.5GHz Cores ≥3.5GHz Cores AltiVec AltiVec ≥3.5GHz Core ≥3.5GHz Core Advanced Multi Core Design Cache Cache Advanced System Features AltiVec AltiVec 1.65+ GHz Core 1.5+ GHz Core 1.5+ GHz Core 1.5+ GHz Core 1+ GHz Core 1+ GHz Core Cache Cache Cache Advanced System Features Advanced System Features Shared L2 Shared L2 Shared L2 Distributed Switch Distributed Switch Distributed Switch POWER Processor Roadmap 2001-4 2004-6 2007-9 2010-13 POWER4 / 4+ POWER6 / 6+ POWER7 POWER5 / 5+ 45 nm 65 nm 90 nm 130 nm 130 nm 180 nm 1.9GHz Core 1.9GHz Core Shared L2 Design phase Requirements definitions Distributed Switch Enhanced Virtualization Advanced Memory Subsystem Decimal Floating Point Check Point Restart Enhanced architecture for higher frequencies Enhanced Scaling Simultaneous Multi-Threading (SMT) Enhanced Distributed Switch Enhanced Core Parallelism Improved FP Performance Increased memory bandwidth Reduced memory latencies Virtualization Chip Multi Processing - Distributed Switch - Shared L2 Dynamic LPARs (32) BINARY COMPATIBILITY
x x X POWER5 Processor pSeries Server p5-595 Multi Chip Module (MCM) … Processor (CPU) (core) … Chip (socket) X 12 Dual Chip Module (DCM) p5-575 (system) (node) (box) e1600 cluster
POWER5 Systems • POWER5 processors • Single or dual processors on each chip • Modules • Dual Chip Modules (DCM) • Multi Chip Modules (MCM) • Nodes • Multiple modules • p5-575, p5-595 • SMP within a node • Cluster • Multiple nodes • Connected with High Speed Switch (HPS)
POWER Processor Progression POWER = Performance Optimized With Enhanced RISC
P P L1 L1 L3 POWER5 Memory Hierarchy • Registers: • Immediately usable • Private L1 cache – on processor • ~ 1 cycle delay • Shared L2 cache – on chip • 12 cycle delay • Shared L3 cache – on module • ~80 cycle delay • Interleaved memory • ~220 cycle delay • Hardware Prefetch • Multiple Page Size support • Small/Medium/Large pages registers L2 Memory Example in daily life: go shopping at Taipei 101
Caches and Memory Hierarchy * if all memory DIMM slots occupied FIFO = First in first out, LRU = Least Recently Used
POWER5 Processor Characteristics • High frequency clocks • Deep pipelines • High asymptotic rates • Superscalar • Speculative out-of-order instructions • Up to 8 outstanding cache line misses • Large number of instructions in flight • Branch prediction • Hardware Prefetching
Multiple Functional Units • Symmetric functional units • Two Floating Point Units (FPU) • Three Fixed Point Units (FXU) • Two Integer • One Control • Two Load/Store Units (LSU) • One Branch Processing Unit (BPU) Control FMA Fixed Pt. Load/Store FMA Fixed Pt. Load/Store Branch
POWER5 Design: Summary • More gates • 170 million 260 million • Enhancements • Increased cache associativity • Increased number of rename registers • Reduced L3 and cache latency • New features • Simultaneous Multi Threading • Dynamic power management
POWER6* 核心 • POWER6 處理器時脈大約兩倍於 POWER5 (4-5GHz) • POWER6指令管線深度相當於POWER5 • 我們將頻率加倍,並保持管線深度不變,因而使每個邏輯級能發揮更大功用。目標是從每個電晶體中獲得更多邏輯功能。 • 更高的頻率,相對提升性能 • 英特爾Itanium電路和技術總監Sam Naffziger表示,“對電路進行調整以從相同的管線中得到更高頻率,IBM功力十足。”他接著說,“頻率的確有其作用。如能保持相同的管線深度和功耗,更高的頻率當然能提升性能。” http://www.eetchina.com/ART_8800414475_617693_d30f1183200604.HTM, 2006年04月14日 Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute ~6ns/instr ~3ns/instr FXU Dependent execution Load Dependent execution • POWER6 延伸 POWER5 核心之功能 • Decimal Unit (十進制單元) • VMX Unit (Altivec 向量單元) • Recovery Unit (復原單元)
Chip enhancement highlights POWER6 POWER5 • 快取記憶體(Cache) • 64K I快取記憶體, 64K D快取記憶體(POWER5, 64K I, 32K D快取記憶體) • 每核心4MB專用二級快取記憶體 (POWER5 1.92MB 由二核心共用) • 32MB 三級快取記憶體per chip (POWER5 36MB) • 記憶體控制器 (Memory Controller) • 雙記憶體控制器(POWER5為單記憶體控制器) • 互連結構Fabric • 3 節點內SMP匯流排 (POWER5: 2 節點內SMP匯流排 )
POWER6* scales chip capabilities with core performance POWER5 p5-570 2.2GHz M E M O R Y 35.2GB/s • POWER6有極高帶寬可提供給處理器。在5GHz下,每個處理器晶片有300GB/s的帶寬 • 80GB/s來自三級快取記憶體 • 75GB/s來自記憶體 • 80GB/s來自MCM內匯流排 • 50GB/s來自遠端處理器 • 20GB/s來自本地I/O。 • POWER6的帶寬通常比POWER5+系統增加了一倍 • 這是由於頻率提高、添加了一些新介面。 • 把I/O頻率從CPU 頻率的三分之一提高到了二分之一。 • POWE6的非核心功能其運行頻率都是核心頻率的一半,2GHz到2.5GHz之間;而各種POWER5+處理器的頻率大約為0.8GHz到1.15GHz。 L3 Dir L3 Shared L2 (1.92MB) Mem Ctrl GX bus 4.4GB/sec 25.58GB/sec Chip to chip POWER6 5GHz
Features and functions IN-CORE HARDWARE ACCELERATORS Decimal Floating point and Altivec™ (VMX) VIRTUALIZATION ENHANCEMENTS 3rd GENERATION MULTI-THREADING
Contents • POWER5 Architecture • Compiler Options • Large / Medium Page • Libraries • OpemMP usage • MPI & PE usage • LoadLeveler usage • Performance Tunning
Quick Reference Page – Cheat Sheet • Which Fortran compiler to use • Compiler options for performance • -O3,-qarch=pwr5,-qtune=pwr5 (use these at minimum) • -hot (High order Transformation) • -pg (profiling) • -qstrict (no alter the semantics of a program) • -qipa (inter procedural analysis)
IBM Compiler Names There are a lot more, including fort77, cc99_128, xlc128_r7…
C Compiler Invocations • Two C compilers: • C and C++ • C is a subset of C++
Fortran Compiler Invocations One fortran compiler. Multiple invocations.
program hello print *, ‘Hello, World’ end xlf_r and mpxlf Example: Hello, World % xlf_r hello.f –l hello <<< using xlf_r % hello Hello, World % mpxlf hello.f –l hello <<<< using mpxlf % hello ERROR: 0031-808 Hostfile or pool must be used to request nodes % hello –procs 4 –hostfile hostfile Hello, World Hello, World Hello, World Hello, World mpxlf will enable the binary to run in SPMD mode across multiple CPUs
xlf Version 10.1 • Traditional allowable extensions: • .f • .F (will pass through cpp before compiling) • New allowable extensions: • .f77 • .f90 • .f95
Address Mode: -q{32,64} • Available application modes: • -q32 (Default) • -q64 • Also: environment variable OBJECT_MODE • export OBJECT_MODE={32,64} • Cannot mix -q32 objects with -q64 objects • Be aware of AIX kernel modes: • 32-bit • 64-bit • Applications address mode is independent of AIX kernel mode
One more thing about 64-bit… • If you use –q64: • Your job can use lots of memory than –q32 • INTEGER*8 or long long operations are faster • If you use –q32: • You may run a few (~10%) percent faster • Fewer bytes are used storing and moving pointers • You will have to learn AIX link options –bmaxdata • -bmaxdata:0x10000000 = 256 Mbyte = default • -bmaxdata:0x80000000 = 2 Gbyte • -bmaxdata:0xC0000000 = not widely publicized trick to use more than 2 Gbyte with –q32 • “C” is the maximum • -q64 • –bmaxdata:0 = default = unlimited • Other –bmaxdata values will be enforced if set
Even more on 64-bit...(because it is so often confused) • 64-bit floating point representation is higher precision • Fortran: REAL*8, DOUBLE PRECISION • C/C++: double • You can use 64-bit floating point with –q32 or –q64 • 64-bit addressing is totally different. It refers to how many bits are used to store memory addresses and ultimately how much memory one can access. • Compile and link with –q64 • Use file a.out myobj.o to query addressing mode • The AIX kernel can be either a build that uses 32-bit addressing for kernel operations or uses 64-bit addressing, but that does not affect an application’s addressibilty. • ls –l /unix to find out which kernel is used • Certain system limits depend on kernel chosen
Suggested Fortran Compiler Usage xlf90_r –q64 • Fortran 90 is the most portable standard • Consistent storage • Dynamic • Reentrant code (..._r). • Required for: • phreads • Many other programming utilities • 64-bit addressing: • Memory management
Suggested C Compiler Usage xlc_r –q64 • Reentrant code (..._r). • Required for: • phreads • Many other programming utilities • 64-bit addressing: • Better memory management
C and C++ Data Type Sizes Long and Pointer change size with –q{32,64}
Fortran Data Type Sizes Pointer change size with –q{32,64}
Memory Management -bmaxdata: extend addressability to 2 GB in 32-bit mode. e.g. “-bmaxdata:0x80000000” (0x70000000 for MPI) -bmaxstack: similar to –bmaxdata but for stack
Memory Allocation: Summary • Programming advice: • Fortran ALLOCATE • C malloc
Summary: Commonly Used Options • -q32, -q64 • -O0, -O2,-O3,-O4,-O5 • Large/medium memory page set up • -qmaxmem=-1(allow max mem for compiling) • -qarch=,-qtune= • -hot (High order Transformation) • -g (debugging) • -p, -pg (profiling) • -qstrict (no alter the semantics of a program) • -qstatic • -qipa (inter procedural analysis) • -qieee • -qlist (assembly lang report) • -qsmp • -qreport(smp list when –qsmp also used)
Example • xlf95 -p needs_tuning.f • a.out mon.out created • prof Example • xlf95 -pg needs_tuning.f • a.out gmon.out created • gprof a.out gmon.out • xprofiler a.out gmon.out Profiling Your Code • Compile the code with –p (or –pg) compiler will set up the object file for profile (or graph profile) • Execute the program. A mon.out (or gmon.out) file will be created • Use prof (or gprof) command to generate a profile • Or xprofiler a.out gmon.out (if you can open xwindow)
Contents • POWER5 Architecture • Compiler Options • Large / Medium Page • Libraries • OpemMP usage • MPI & PE usage • LoadLeveler usage • Performance Tunning
Large Pages Option-qlargepage • This option DOES NOT enable large page • See software chapter for details of large page discussion • It instructs the compiler to exploit large page heaps available on POWER4 and POWER5 systems • HINT to the compiler: • Heap data will be allocated from large page pool • Actual control is from loader option –blpdata or LDR_CNTRL=LARGE_PAGE_DATA • Compiler will (might) divert large data from the stack to the heap • Compiler may bias optimization of heap or static data references
Quick Reference Page – Cheat SheetEnable application for medium and large page To use 64K medium pages: ldedit -bnolpdata a.out ldedit -bdatapsize=64K -bstackpsize=64K -btextpsize=64K a.out To use 16M large pages: ldedit -bdatapsize=0 -bstackpsize=0 -btextpsize=0 a.out ldedit -blpdata a.out To check large page usage: vmstat -l 5 To set up Memory Affinity export MEMORY_AFFINITY=MCM
Unit of Memory = page • Various page sizes supported by AIX • Small page size = 4096 bytes (Default) • Large page size = 16 MB, introduced for POWER4 • Medium page size = 64K, introduced for POWER5+ • Benefit of medium and large pages: • Enhance memory bandwidth • Prefetch performance is limited by page size • Enhance Translation Lookaside Buffer (TLB) coverage
Large Page • To allocate Large page: • statically at boot-time • Or dynamically with AIX 5.3 • Need large pages for /gpfs support • Need small pages for AIX and other jobs • Recommendation: • No more than 85% of memory in Large Pages • A major catch: • A large-paged application can use small page • A small-paged application can not use large page
New: Medium Page Summary • Performance is same as large pages for most applications • Allocated dynamically by system as needed • Different ldedit options: • ldedit –bdatapsize=64K a.out • ldedit –bstackpsize=64K a.out • ldedit –btextpsize=4K a.out • None of the administrative headaches that large pages can cause
How To Mark a Binary (Executable) to Use Large or Medium Pages • Large Pages : Applicable to POWER4, POWER5, and POWER5+ systems running under AIX 5.1, 5.2 and 5.3. ldedit -bdatapsize:0 -bstackpsize:0 -btextpsize:0 a.out ldedit -blpdata a.out • Medium Pages Applicable to POWER5+ systems running under AIX 5.3. ldedit -bnolpdata a.out ldedit -btextpsize:64k -bdatapsize:64K -bstackpsize:64K a.out
How To Mark a Binary (Executable) to Use Large or Medium Pages • Removing all Large and Medium Page Marking ldedit -bnolpdata a.out ldedit -bdatapsize:0 -bstackpsize:0 -btextpsize:0 a.outNote • Checking for Large or Medium Page Marking /vol/local/bin/lsxcoff a.out
Advantage of Large Pages: Bandwidth Enhancement POWER5 1.45 GHz
Large Page Usage • Enable your code for large page: using loader or ldedit: • $ xlf .... -blpdata -o a.out • $ /usr/bin/ldedit -blpdata a.out • Environment variable (NOT RECOMMENDED!!): • $ LDR_CNTRL=LARGE_PAGE_DATA={Y,N,M} • Y: Yes ("Advisory") mode • Use large pages if available • This mode used by loader and ldedit • N: No large pages • M: Mandatory • Do not run if large pages not available • “env LDR_CNTRL=LARGE_PAGE_DATA=M date” to see if large pages are working for you
Large Page Summary • Set up memory pools • Decide amount of memory in Small and in Large pages • Administrator: • vmo …. • chuser capabilities= … • Tag binary: • {xlf,xlc} … -blpdata -o a.out • ldedit … -blpdata a.out • $ksh LDR_CNTRL=LARGE_PAGE_DATA=M a.out $ vmstat –l kthr memory … cpu large-page ----- --------------- ------------- ----------- r b avm fre us sy id wa alp flp 1 0 18411462 14344396 296 2 0 98 12 4084 0 0 18411463 14344395 2 0 98 0 12 4084 alp: Allocated Large Pages flp: Free Large pages
CompChem App Performance Comparisonp575 64 CPU – David ChenDec 10, 2006 YH: submitted results, S,M,L: small, medium, large page, US: US window Using identical executable binary