ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture Prasanta Ghosh Sr. Manager- Performance & Development pghosh@sybase.com August 15-19, 2004

The Enterprise. Unwired.

The Enterprise. Unwired. Industry and Cross Platform Solutions Manage Information Unwire Information Unwire People • Adaptive Server Enterprise • Adaptive Server Anywhere • Sybase IQ • Dynamic Archive • Dynamic ODS • Replication Server • OpenSwitch • Mirror Activator • PowerDesigner • Connectivity Options • EAServer • Industry Warehouse Studio • Unwired Accelerator • Unwired Orchestrator • Unwired Toolkit • Enterprise Portal • Real Time Data Services • SQL Anywhere Studio • M-Business Anywhere • Pylon Family (Mobile Email) • Mobile Sales • XcelleNet Frontline Solutions • PocketBuilder • PowerBuilder Family • AvantGo Sybase Workspace

What will we learn? • Processor Trends • Relevant to the Database world • Present architectural issues • Compiler technology • ASE Architecture • Adapting to new processors • Keeping up with OLTP performance • Discuss some of the hot performance related topics • Questions • Discussions • Interactive

Processor: CISC – RISC and EPIC CISC (Complex Instruction Set Computing) • Intel and AMD’s x86 processor set RISC (Reduced Instruction Set Computing) • Goal to optimize performance with simpler instructions EPIC (Explicitly Parallel Instruction Computing) • Goal to move beyond RISC performance bounds with explicit parallel instruction streams

Processor Speed • Is Higher clock speed better? • Not always • 3.0Ghz Xeon vs. 1.5Ghz Itanium2

Processor Speed : ASE behavior • Obviously • faster processing • better response time • Plus • more context switches : e.g 112296 vs. 522115 per minute • not when the engines are idling • demands more from disk IO performance Time

Processor Architecture: 64bit Processing • 64bit data and address: better performance • Must for large database environments • 2 versions of OS kernel and ASE for the same platform • Do I need to use 64bit if I don’t need > 4B memory access?

ASE: Network and Engine Affinity • Network Affinity • User connection to ASE • Idling or least loaded engine picks up the incoming connection • Network IO for the user task is performed by that engine • Engine Affinity • Related to process scheduling • soft binding is automatic (user transparent) • Can use application partitioning to do hard binding • Runs on that engine as long as it can • Network affinity remains unchanged • unless that engine is made offline • Engine affinity changes • Due to stealing algorithm • Critical resource contention

ASE: Engine Affinity • Scheduling • Engine Local runnable queue • Global runnable queue • Tasks are mostly in engine runnable queue • Occasionally in global runnable queue • Engine Stealing Engine 0 queue Kernel queue Engine 1 queue

Processor Architecture: some more • Hyper threading • Intel Xeon processor • Hyper Transport • high-speed, low latency, point-to-point link • Data throughput 22.4GB/sec • Dual Core • PA-8800, Power5 • Chip Multithreading Technology (CMT) • Sun Ultra Sparc IV • Non Uniform Memory Access (NUMA) • Critical for large database applications using huge memory • Large Register Set • Itanium2 has 128 registers

Hyper threading and ASE • Should I enable hyper threading for ASE? • Our Experience: • On single cpu system, hyper threading helps • On SMP systems, hyper threading does not always help • Linux AS 2.1 has some scheduling issue • Which is fixed in RHEL 3.0 • Does not help on a highly active system where engines are fully utilized • Haven’t seen 30% gain for ASE configuration

Processor Architecture Limits and EPIC Solutions Problem : Memory/CPU Latency is already large and growing – Solution: Speculative Loads for Data and Instructions Problem : Increasing amount of conditional and/or unpredictable branches in code -- Solution: Predication and prediction of branches and conditionals orchestrated by the compiler to use the EPIC Architecture Problem : Complexity of multiple pipelines is too great for effective on chip scheduling – Solution – Compiler handles scheduling and produces code to take advantage of the on – chip resources Problem : Registers and chip resource availability limit parallelism -- Solution: Increase the number of registers by 4X ( 32- 128 )

Traditional Architecture Limiters Sequential Machine Code Original Source Code Compiler Hardware parallelized code parallelized code multiple functional units Execution Units Available Used Inefficiently . . . . . . . . . . . . Today’s Processors often 60% Idle

Explicit Parallelism • Instruction Level Parallelism (ILP) is ability to execute multiple instructions at the same time • Explicitly Parallel Instruction Computing (EPIC) allows the compiler or assembler to specify the parallelism • Compiler specifies Instruction Groups, a list of instructions with no dependencies that can be executed in parallel • Stop bit or taken branch indicates instruction group boundary • Instructions are packed in bundles of 3 instructions each • Template field directly maps each instruction to an execution unit allowing easy parallel dispatch of the instructions Template 5 bits Instruction 1 41 bits Instruction 2 41 bits Instruction 3 41bits stop stop stop

Processor Architecture: TLB miss • Translation Look ahead Buffer (TLB) • It’s a fixed table size • Processor uses to search the data in local cache • Large memory configuration • Common to database applications • more chances of TLB miss • Locking the shared memory • Variable OS page size • 4KB vs. 8MB or 16MB

Processor speed vs. Memory Access • Cpu speed doubles every 1.5 years • Memory speed doubles every 10 years • High speed cpu • Mostly underutilized

Reduce Memory Latency • Internal cache: • L1, L2, L3 cache • memory closer to processor • On chip or off chip • Shared by CPUs • Data/Instruction cache separate

Internal Cache • ASE is optimized to make use of the L1/L2/L3 cache • Database applications are memory intensive • New systems: What to watch for? • Higher clock speed • Higher front side bus speed • Large L1/L2/L3 cache • Lower memory latency • Follow OEM guidelines • e.g: same speed memory DIMMs

Internal Cache: Separate L1/L2 Cache

L2 Cache L2 Cache Bus PA-8800 Dual CPUs Internal Cache: Shared L2/L3 Cache • Level 2 Cache Boosts Performance • Size (32MB) and proximity of the L2 cache to the processors increases performance for many workloads More than the CPUs Inside the Processor Module … On-Chip Cache Controller Speeds Access and Protects Data • On-chip tags help the cache controller quickly locate and send data to CPU • ECC protection for data tags, cached data and in-flight data System Bus

Internal Cache: ASE optimizations • Smaller footprint • Avoid random access of memory • Only few OS processes • Structure alignments • Minimize cross engine data access • Compiler optimization to pre-fetch data • Better branch prediction

ASE FBO Server: Speculation • Allows compiler to issue operation early before a dependency • Removes latency of operation from the critical path • Helps hide long latency memory operations • Two type of speculation: • Control Speculation, which is the execution of an operation before the branch which guards it • Data Speculation, which is the execution of a memory load prior to a preceding store which may alias with it

ASE FBO Server: Predication • Allows instructions to be conditionally executed • Predicate register operand controls execution • Removes branches and associated mispredict penalties • Creates larger basic blocks and simplifies compiler optimizations Example: cmp.eq p1,p2 = r1,r2 ;; (p1) add r1 = r2, 4 (p2) ld8.sa r7 = [r8], 8 If p1 is true, the add is performed, else it acts as a nop If p2 is true, the ld8 is performed, else it acts as a nop

ASE FBO Server : Optimizations! • Profile Guided Optimizations • Also known as FBO or PBO • Runs typical load using an instrumented server • Collects data on execution profiling • Generates highly optimized code! • Anywhere between 10-40% performance gain

Operating System Disks Engine 0 Engine 1 ... Engine N CPUs Registers File Descriptors Registers File Descriptors Registers File Descriptors 5 2 1 Running Running Running Shared Executable (Program Memory) Shared Memory Lock Chains Proc Cache Sleep Queue Hash Run Queues lock sleep 4 6 3 disk I/O 7 Pending I/Os N E T N E T D I S K send sleep Other Memory 8 ASE architecture: High level view

Cacheline and Data structure • Main memory to internal cache transfer happens in chunks • 32byte, 64byte or 128byte • In database applications, load misses consumes almost 90% of cpu cycles • Avoid load misses by rearranging the fields in a structure • Write only fields • Read only fields • Fields accessed simultaneously Struct Process { int id; char name[200]; int state; }

Spin lock optimizations • Light weight synchronization mechanism • Effective only when running with more than 1 ASE engines • Inefficient algorithm can waster cpu cycles • Must fit in one cache line boundary • Varies from platform to platform • Multiple spin lock structures in single cache line • Too many unnecessary dirty flushes among the cpus • Hyper threading and Intel’s pause instruction

Spin lock decomposition 79%

ASE Architecture: Storage Device • Efficient IO is critical for system performance • Process scheduling and interrupt handling is important • SCSI or Fiber Channel • Disk spindle – RPM • Controller Cache • RAID 0 or RAID 1 or RAID 5 • Synchronous vs. Asynchronous IO

ASE Architecture: File System vs. Raw • Raw for log IO • RAID 0 for data devices • RAID 01 for log devices • File System for better management • For 32bit, platforms for which memory is limited, file system for data devices is recommended. • 4GB memory access limit • OS allocates the rest of memory of the memory for File System • Mostly read intensive and not heavy write intensive • This results in better read response for applications

ASE Architecture: File system vs. Raw devices • Use mix of File System and Raw devices. 60%

ASE Architecture: Journaling on or off? • EXT3 with journaling disabled 24%

ASE Architecture: Large Memory Support • Xeon has the PAE architecture • Allows applications to address up to 64GB of memory • ASE on Linux as of 12.5.2 release can support up to 64GB memory • Easy configuration to setup for large memory feature

Large Memory Support on Linux 32 bit • Intel has the PAE architecture • Allows applications to address up to 64GB of memory • Memory usage in ASE • Most of the memory on a given system is used for data caches • Avoid expensive disk reads and writes • File system devices cache the data in OS/FS cache • Double copying problem resulting in wastage of memory • Writes are very expensive • Increased CPU bandwidth on Xeon is underutilized by not having large memory support • Most productions environments have raw devices for ASE • causes under utilization of the system memory

ASE Architecture: Large Memory Support

Myth: ASE Engines vs. # of CPUS • Can I have more engines than # of cpus? • Single server installation • no need to have more engines • Multiple ASE servers on a single system • total number of engines exceeding the # of CPUS • No simple ‘Yes/No’ answer

Myth: ASE taking most of the cpu cycles • ASE always looks for work • Consumes cpu cycles when idling, but only for a fraction of milliseconds • With increasing CPU clock speed • The problem seems more severe • ASE is being improved to release cpu cycles as soon as possible • But ensure that the users’ response time is not affected • Typical ASE tuning • Number of spins before releasing cpu • Active IO and idling • Network and disk IO checks

Summary • Processor technology continues to improve • Higher clock speed • Dual core chip • EPIC architecture • Lot more improvement to expect for Memory Latency • More internal cache • Parallel execution engines • Parallelism pushed to compiler technology • ASE architecture makes use of new technology • Best OLTP engine • New optimizer and execution engine • Efficient handling of large data set

Questions

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture

Presentation Transcript

COSC 2206 Internet Tools

Tuning and Scalability for Sonic

Storage Performance for SQL Server

Using Performance Monitoring Hardware for Application Performance Analysis

DoD Information Enterprise Architecture v2.0

Automatic Performance Tuning of Sparse-Matrix-Vector-Multiplication (SpMV) and Iterative Sparse Solvers

EXTREME PERFORMANCE TUNING

Lecture on High Performance Processor Architecture ( CS05162 )

Performance Tuning Informix Database Engine

Computer Organization and Architecture William Stallings 8th Edition

Performance Tuning in SQL Server

CSI 227

Performance Tuning Using Hardware Counter Data

Making the Linux NFS Server Suck Faster

Katherine Yelick Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

Chapter 10 Stability Analysis and Controller Tuning

Maximo 7.5 Upgrade Darlene Nerden 20 June 2012

Designing High Performance I/O Systems for SQL Server

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

I/O Performance Analysis and Tuning: From the Application to the Storage Device