400 likes | 591 Views
ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture. Prasanta Ghosh Sr. Manager- Performance & Development pghosh@sybase.com August 15-19, 2004. The Enterprise. Unwired. The Enterprise. Unwired. Industry and Cross Platform Solutions. Manage Information.
E N D
ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture Prasanta Ghosh Sr. Manager- Performance & Development pghosh@sybase.com August 15-19, 2004
The Enterprise. Unwired. Industry and Cross Platform Solutions Manage Information Unwire Information Unwire People • Adaptive Server Enterprise • Adaptive Server Anywhere • Sybase IQ • Dynamic Archive • Dynamic ODS • Replication Server • OpenSwitch • Mirror Activator • PowerDesigner • Connectivity Options • EAServer • Industry Warehouse Studio • Unwired Accelerator • Unwired Orchestrator • Unwired Toolkit • Enterprise Portal • Real Time Data Services • SQL Anywhere Studio • M-Business Anywhere • Pylon Family (Mobile Email) • Mobile Sales • XcelleNet Frontline Solutions • PocketBuilder • PowerBuilder Family • AvantGo Sybase Workspace
What will we learn? • Processor Trends • Relevant to the Database world • Present architectural issues • Compiler technology • ASE Architecture • Adapting to new processors • Keeping up with OLTP performance • Discuss some of the hot performance related topics • Questions • Discussions • Interactive
Processor: CISC – RISC and EPIC CISC (Complex Instruction Set Computing) • Intel and AMD’s x86 processor set RISC (Reduced Instruction Set Computing) • Goal to optimize performance with simpler instructions EPIC (Explicitly Parallel Instruction Computing) • Goal to move beyond RISC performance bounds with explicit parallel instruction streams
Processor Speed • Is Higher clock speed better? • Not always • 3.0Ghz Xeon vs. 1.5Ghz Itanium2
Processor Speed : ASE behavior • Obviously • faster processing • better response time • Plus • more context switches : e.g 112296 vs. 522115 per minute • not when the engines are idling • demands more from disk IO performance Time
Processor Architecture: 64bit Processing • 64bit data and address: better performance • Must for large database environments • 2 versions of OS kernel and ASE for the same platform • Do I need to use 64bit if I don’t need > 4B memory access?
ASE: Network and Engine Affinity • Network Affinity • User connection to ASE • Idling or least loaded engine picks up the incoming connection • Network IO for the user task is performed by that engine • Engine Affinity • Related to process scheduling • soft binding is automatic (user transparent) • Can use application partitioning to do hard binding • Runs on that engine as long as it can • Network affinity remains unchanged • unless that engine is made offline • Engine affinity changes • Due to stealing algorithm • Critical resource contention
ASE: Engine Affinity • Scheduling • Engine Local runnable queue • Global runnable queue • Tasks are mostly in engine runnable queue • Occasionally in global runnable queue • Engine Stealing Engine 0 queue Kernel queue Engine 1 queue
Processor Architecture: some more • Hyper threading • Intel Xeon processor • Hyper Transport • high-speed, low latency, point-to-point link • Data throughput 22.4GB/sec • Dual Core • PA-8800, Power5 • Chip Multithreading Technology (CMT) • Sun Ultra Sparc IV • Non Uniform Memory Access (NUMA) • Critical for large database applications using huge memory • Large Register Set • Itanium2 has 128 registers
Hyper threading and ASE • Should I enable hyper threading for ASE? • Our Experience: • On single cpu system, hyper threading helps • On SMP systems, hyper threading does not always help • Linux AS 2.1 has some scheduling issue • Which is fixed in RHEL 3.0 • Does not help on a highly active system where engines are fully utilized • Haven’t seen 30% gain for ASE configuration
Processor Architecture Limits and EPIC Solutions Problem : Memory/CPU Latency is already large and growing – Solution: Speculative Loads for Data and Instructions Problem : Increasing amount of conditional and/or unpredictable branches in code -- Solution: Predication and prediction of branches and conditionals orchestrated by the compiler to use the EPIC Architecture Problem : Complexity of multiple pipelines is too great for effective on chip scheduling – Solution – Compiler handles scheduling and produces code to take advantage of the on – chip resources Problem : Registers and chip resource availability limit parallelism -- Solution: Increase the number of registers by 4X ( 32- 128 )
Traditional Architecture Limiters Sequential Machine Code Original Source Code Compiler Hardware parallelized code parallelized code multiple functional units Execution Units Available Used Inefficiently . . . . . . . . . . . . Today’s Processors often 60% Idle
Explicit Parallelism • Instruction Level Parallelism (ILP) is ability to execute multiple instructions at the same time • Explicitly Parallel Instruction Computing (EPIC) allows the compiler or assembler to specify the parallelism • Compiler specifies Instruction Groups, a list of instructions with no dependencies that can be executed in parallel • Stop bit or taken branch indicates instruction group boundary • Instructions are packed in bundles of 3 instructions each • Template field directly maps each instruction to an execution unit allowing easy parallel dispatch of the instructions Template 5 bits Instruction 1 41 bits Instruction 2 41 bits Instruction 3 41bits stop stop stop
Processor Architecture: TLB miss • Translation Look ahead Buffer (TLB) • It’s a fixed table size • Processor uses to search the data in local cache • Large memory configuration • Common to database applications • more chances of TLB miss • Locking the shared memory • Variable OS page size • 4KB vs. 8MB or 16MB
Processor speed vs. Memory Access • Cpu speed doubles every 1.5 years • Memory speed doubles every 10 years • High speed cpu • Mostly underutilized
Reduce Memory Latency • Internal cache: • L1, L2, L3 cache • memory closer to processor • On chip or off chip • Shared by CPUs • Data/Instruction cache separate
Internal Cache • ASE is optimized to make use of the L1/L2/L3 cache • Database applications are memory intensive • New systems: What to watch for? • Higher clock speed • Higher front side bus speed • Large L1/L2/L3 cache • Lower memory latency • Follow OEM guidelines • e.g: same speed memory DIMMs
L2 Cache L2 Cache Bus PA-8800 Dual CPUs Internal Cache: Shared L2/L3 Cache • Level 2 Cache Boosts Performance • Size (32MB) and proximity of the L2 cache to the processors increases performance for many workloads More than the CPUs Inside the Processor Module … On-Chip Cache Controller Speeds Access and Protects Data • On-chip tags help the cache controller quickly locate and send data to CPU • ECC protection for data tags, cached data and in-flight data System Bus
Internal Cache: ASE optimizations • Smaller footprint • Avoid random access of memory • Only few OS processes • Structure alignments • Minimize cross engine data access • Compiler optimization to pre-fetch data • Better branch prediction
ASE FBO Server: Speculation • Allows compiler to issue operation early before a dependency • Removes latency of operation from the critical path • Helps hide long latency memory operations • Two type of speculation: • Control Speculation, which is the execution of an operation before the branch which guards it • Data Speculation, which is the execution of a memory load prior to a preceding store which may alias with it
ASE FBO Server: Predication • Allows instructions to be conditionally executed • Predicate register operand controls execution • Removes branches and associated mispredict penalties • Creates larger basic blocks and simplifies compiler optimizations Example: cmp.eq p1,p2 = r1,r2 ;; (p1) add r1 = r2, 4 (p2) ld8.sa r7 = [r8], 8 If p1 is true, the add is performed, else it acts as a nop If p2 is true, the ld8 is performed, else it acts as a nop
ASE FBO Server : Optimizations! • Profile Guided Optimizations • Also known as FBO or PBO • Runs typical load using an instrumented server • Collects data on execution profiling • Generates highly optimized code! • Anywhere between 10-40% performance gain
Operating System Disks Engine 0 Engine 1 ... Engine N CPUs Registers File Descriptors Registers File Descriptors Registers File Descriptors 5 2 1 Running Running Running Shared Executable (Program Memory) Shared Memory Lock Chains Proc Cache Sleep Queue Hash Run Queues lock sleep 4 6 3 disk I/O 7 Pending I/Os N E T N E T D I S K send sleep Other Memory 8 ASE architecture: High level view
Cacheline and Data structure • Main memory to internal cache transfer happens in chunks • 32byte, 64byte or 128byte • In database applications, load misses consumes almost 90% of cpu cycles • Avoid load misses by rearranging the fields in a structure • Write only fields • Read only fields • Fields accessed simultaneously Struct Process { int id; char name[200]; int state; }
Spin lock optimizations • Light weight synchronization mechanism • Effective only when running with more than 1 ASE engines • Inefficient algorithm can waster cpu cycles • Must fit in one cache line boundary • Varies from platform to platform • Multiple spin lock structures in single cache line • Too many unnecessary dirty flushes among the cpus • Hyper threading and Intel’s pause instruction
ASE Architecture: Storage Device • Efficient IO is critical for system performance • Process scheduling and interrupt handling is important • SCSI or Fiber Channel • Disk spindle – RPM • Controller Cache • RAID 0 or RAID 1 or RAID 5 • Synchronous vs. Asynchronous IO
ASE Architecture: File System vs. Raw • Raw for log IO • RAID 0 for data devices • RAID 01 for log devices • File System for better management • For 32bit, platforms for which memory is limited, file system for data devices is recommended. • 4GB memory access limit • OS allocates the rest of memory of the memory for File System • Mostly read intensive and not heavy write intensive • This results in better read response for applications
ASE Architecture: File system vs. Raw devices • Use mix of File System and Raw devices. 60%
ASE Architecture: Journaling on or off? • EXT3 with journaling disabled 24%
ASE Architecture: Large Memory Support • Xeon has the PAE architecture • Allows applications to address up to 64GB of memory • ASE on Linux as of 12.5.2 release can support up to 64GB memory • Easy configuration to setup for large memory feature
Large Memory Support on Linux 32 bit • Intel has the PAE architecture • Allows applications to address up to 64GB of memory • Memory usage in ASE • Most of the memory on a given system is used for data caches • Avoid expensive disk reads and writes • File system devices cache the data in OS/FS cache • Double copying problem resulting in wastage of memory • Writes are very expensive • Increased CPU bandwidth on Xeon is underutilized by not having large memory support • Most productions environments have raw devices for ASE • causes under utilization of the system memory
Myth: ASE Engines vs. # of CPUS • Can I have more engines than # of cpus? • Single server installation • no need to have more engines • Multiple ASE servers on a single system • total number of engines exceeding the # of CPUS • No simple ‘Yes/No’ answer
Myth: ASE taking most of the cpu cycles • ASE always looks for work • Consumes cpu cycles when idling, but only for a fraction of milliseconds • With increasing CPU clock speed • The problem seems more severe • ASE is being improved to release cpu cycles as soon as possible • But ensure that the users’ response time is not affected • Typical ASE tuning • Number of spins before releasing cpu • Active IO and idling • Network and disk IO checks
Summary • Processor technology continues to improve • Higher clock speed • Dual core chip • EPIC architecture • Lot more improvement to expect for Memory Latency • More internal cache • Parallel execution engines • Parallelism pushed to compiler technology • ASE architecture makes use of new technology • Best OLTP engine • New optimizer and execution engine • Efficient handling of large data set