1 / 40

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture. Prasanta Ghosh Sr. Manager- Performance & Development pghosh@sybase.com August 15-19, 2004. The Enterprise. Unwired. The Enterprise. Unwired. Industry and Cross Platform Solutions. Manage Information.

Download Presentation

ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASE112: Adaptive Server Enterprise Performance Tuning on Next Generation Architecture Prasanta Ghosh Sr. Manager- Performance & Development pghosh@sybase.com August 15-19, 2004

  2. The Enterprise. Unwired.

  3. The Enterprise. Unwired. Industry and Cross Platform Solutions Manage Information Unwire Information Unwire People • Adaptive Server Enterprise • Adaptive Server Anywhere • Sybase IQ • Dynamic Archive • Dynamic ODS • Replication Server • OpenSwitch • Mirror Activator • PowerDesigner • Connectivity Options • EAServer • Industry Warehouse Studio • Unwired Accelerator • Unwired Orchestrator • Unwired Toolkit • Enterprise Portal • Real Time Data Services • SQL Anywhere Studio • M-Business Anywhere • Pylon Family (Mobile Email) • Mobile Sales • XcelleNet Frontline Solutions • PocketBuilder • PowerBuilder Family • AvantGo Sybase Workspace

  4. What will we learn? • Processor Trends • Relevant to the Database world • Present architectural issues • Compiler technology • ASE Architecture • Adapting to new processors • Keeping up with OLTP performance • Discuss some of the hot performance related topics • Questions • Discussions • Interactive

  5. Processor: CISC – RISC and EPIC CISC (Complex Instruction Set Computing) • Intel and AMD’s x86 processor set RISC (Reduced Instruction Set Computing) • Goal to optimize performance with simpler instructions EPIC (Explicitly Parallel Instruction Computing) • Goal to move beyond RISC performance bounds with explicit parallel instruction streams

  6. Processor Speed • Is Higher clock speed better? • Not always • 3.0Ghz Xeon vs. 1.5Ghz Itanium2

  7. Processor Speed : ASE behavior • Obviously • faster processing • better response time • Plus • more context switches : e.g 112296 vs. 522115 per minute • not when the engines are idling • demands more from disk IO performance Time

  8. Processor Architecture: 64bit Processing • 64bit data and address: better performance • Must for large database environments • 2 versions of OS kernel and ASE for the same platform • Do I need to use 64bit if I don’t need > 4B memory access?

  9. ASE: Network and Engine Affinity • Network Affinity • User connection to ASE • Idling or least loaded engine picks up the incoming connection • Network IO for the user task is performed by that engine • Engine Affinity • Related to process scheduling • soft binding is automatic (user transparent) • Can use application partitioning to do hard binding • Runs on that engine as long as it can • Network affinity remains unchanged • unless that engine is made offline • Engine affinity changes • Due to stealing algorithm • Critical resource contention

  10. ASE: Engine Affinity • Scheduling • Engine Local runnable queue • Global runnable queue • Tasks are mostly in engine runnable queue • Occasionally in global runnable queue • Engine Stealing Engine 0 queue Kernel queue Engine 1 queue

  11. Processor Architecture: some more • Hyper threading • Intel Xeon processor • Hyper Transport • high-speed, low latency, point-to-point link • Data throughput 22.4GB/sec • Dual Core • PA-8800, Power5 • Chip Multithreading Technology (CMT) • Sun Ultra Sparc IV • Non Uniform Memory Access (NUMA) • Critical for large database applications using huge memory • Large Register Set • Itanium2 has 128 registers

  12. Hyper threading and ASE • Should I enable hyper threading for ASE? • Our Experience: • On single cpu system, hyper threading helps • On SMP systems, hyper threading does not always help • Linux AS 2.1 has some scheduling issue • Which is fixed in RHEL 3.0 • Does not help on a highly active system where engines are fully utilized • Haven’t seen 30% gain for ASE configuration

  13. Processor Architecture Limits and EPIC Solutions Problem : Memory/CPU Latency is already large and growing – Solution: Speculative Loads for Data and Instructions Problem : Increasing amount of conditional and/or unpredictable branches in code -- Solution: Predication and prediction of branches and conditionals orchestrated by the compiler to use the EPIC Architecture Problem : Complexity of multiple pipelines is too great for effective on chip scheduling – Solution – Compiler handles scheduling and produces code to take advantage of the on – chip resources Problem : Registers and chip resource availability limit parallelism -- Solution: Increase the number of registers by 4X ( 32- 128 )

  14. Traditional Architecture Limiters Sequential Machine Code Original Source Code Compiler Hardware parallelized code parallelized code multiple functional units Execution Units Available Used Inefficiently . . . . . . . . . . . . Today’s Processors often 60% Idle

  15. Explicit Parallelism • Instruction Level Parallelism (ILP) is ability to execute multiple instructions at the same time • Explicitly Parallel Instruction Computing (EPIC) allows the compiler or assembler to specify the parallelism • Compiler specifies Instruction Groups, a list of instructions with no dependencies that can be executed in parallel • Stop bit or taken branch indicates instruction group boundary • Instructions are packed in bundles of 3 instructions each • Template field directly maps each instruction to an execution unit allowing easy parallel dispatch of the instructions Template 5 bits Instruction 1 41 bits Instruction 2 41 bits Instruction 3 41bits stop stop stop

  16. Processor Architecture: TLB miss • Translation Look ahead Buffer (TLB) • It’s a fixed table size • Processor uses to search the data in local cache • Large memory configuration • Common to database applications • more chances of TLB miss • Locking the shared memory • Variable OS page size • 4KB vs. 8MB or 16MB

  17. Processor speed vs. Memory Access • Cpu speed doubles every 1.5 years • Memory speed doubles every 10 years • High speed cpu • Mostly underutilized

  18. Reduce Memory Latency • Internal cache: • L1, L2, L3 cache • memory closer to processor • On chip or off chip • Shared by CPUs • Data/Instruction cache separate

  19. Internal Cache • ASE is optimized to make use of the L1/L2/L3 cache • Database applications are memory intensive • New systems: What to watch for? • Higher clock speed • Higher front side bus speed • Large L1/L2/L3 cache • Lower memory latency • Follow OEM guidelines • e.g: same speed memory DIMMs

  20. Internal Cache: Separate L1/L2 Cache

  21. L2 Cache L2 Cache Bus PA-8800 Dual CPUs Internal Cache: Shared L2/L3 Cache • Level 2 Cache Boosts Performance • Size (32MB) and proximity of the L2 cache to the processors increases performance for many workloads More than the CPUs Inside the Processor Module … On-Chip Cache Controller Speeds Access and Protects Data • On-chip tags help the cache controller quickly locate and send data to CPU • ECC protection for data tags, cached data and in-flight data System Bus

  22. Internal Cache: ASE optimizations • Smaller footprint • Avoid random access of memory • Only few OS processes • Structure alignments • Minimize cross engine data access • Compiler optimization to pre-fetch data • Better branch prediction

  23. ASE FBO Server: Speculation • Allows compiler to issue operation early before a dependency • Removes latency of operation from the critical path • Helps hide long latency memory operations • Two type of speculation: • Control Speculation, which is the execution of an operation before the branch which guards it • Data Speculation, which is the execution of a memory load prior to a preceding store which may alias with it

  24. ASE FBO Server: Predication • Allows instructions to be conditionally executed • Predicate register operand controls execution • Removes branches and associated mispredict penalties • Creates larger basic blocks and simplifies compiler optimizations Example: cmp.eq p1,p2 = r1,r2 ;; (p1) add r1 = r2, 4 (p2) ld8.sa r7 = [r8], 8 If p1 is true, the add is performed, else it acts as a nop If p2 is true, the ld8 is performed, else it acts as a nop

  25. ASE FBO Server : Optimizations! • Profile Guided Optimizations • Also known as FBO or PBO • Runs typical load using an instrumented server • Collects data on execution profiling • Generates highly optimized code! • Anywhere between 10-40% performance gain

  26. Operating System Disks Engine 0 Engine 1 ... Engine N CPUs Registers File Descriptors Registers File Descriptors Registers File Descriptors 5 2 1 Running Running Running Shared Executable (Program Memory) Shared Memory Lock Chains Proc Cache Sleep Queue Hash Run Queues lock sleep 4 6 3 disk I/O 7 Pending I/Os N E T N E T D I S K send sleep Other Memory 8 ASE architecture: High level view

  27. Cacheline and Data structure • Main memory to internal cache transfer happens in chunks • 32byte, 64byte or 128byte • In database applications, load misses consumes almost 90% of cpu cycles • Avoid load misses by rearranging the fields in a structure • Write only fields • Read only fields • Fields accessed simultaneously Struct Process { int id; char name[200]; int state; }

  28. Spin lock optimizations • Light weight synchronization mechanism • Effective only when running with more than 1 ASE engines • Inefficient algorithm can waster cpu cycles • Must fit in one cache line boundary • Varies from platform to platform • Multiple spin lock structures in single cache line • Too many unnecessary dirty flushes among the cpus • Hyper threading and Intel’s pause instruction

  29. Spin lock decomposition 79%

  30. ASE Architecture: Storage Device • Efficient IO is critical for system performance • Process scheduling and interrupt handling is important • SCSI or Fiber Channel • Disk spindle – RPM • Controller Cache • RAID 0 or RAID 1 or RAID 5 • Synchronous vs. Asynchronous IO

  31. ASE Architecture: File System vs. Raw • Raw for log IO • RAID 0 for data devices • RAID 01 for log devices • File System for better management • For 32bit, platforms for which memory is limited, file system for data devices is recommended. • 4GB memory access limit • OS allocates the rest of memory of the memory for File System • Mostly read intensive and not heavy write intensive • This results in better read response for applications

  32. ASE Architecture: File system vs. Raw devices • Use mix of File System and Raw devices. 60%

  33. ASE Architecture: Journaling on or off? • EXT3 with journaling disabled 24%

  34. ASE Architecture: Large Memory Support • Xeon has the PAE architecture • Allows applications to address up to 64GB of memory • ASE on Linux as of 12.5.2 release can support up to 64GB memory • Easy configuration to setup for large memory feature

  35. Large Memory Support on Linux 32 bit • Intel has the PAE architecture • Allows applications to address up to 64GB of memory • Memory usage in ASE • Most of the memory on a given system is used for data caches • Avoid expensive disk reads and writes • File system devices cache the data in OS/FS cache • Double copying problem resulting in wastage of memory • Writes are very expensive • Increased CPU bandwidth on Xeon is underutilized by not having large memory support • Most productions environments have raw devices for ASE • causes under utilization of the system memory

  36. ASE Architecture: Large Memory Support

  37. Myth: ASE Engines vs. # of CPUS • Can I have more engines than # of cpus? • Single server installation • no need to have more engines • Multiple ASE servers on a single system • total number of engines exceeding the # of CPUS • No simple ‘Yes/No’ answer

  38. Myth: ASE taking most of the cpu cycles • ASE always looks for work • Consumes cpu cycles when idling, but only for a fraction of milliseconds • With increasing CPU clock speed • The problem seems more severe • ASE is being improved to release cpu cycles as soon as possible • But ensure that the users’ response time is not affected • Typical ASE tuning • Number of spins before releasing cpu • Active IO and idling • Network and disk IO checks

  39. Summary • Processor technology continues to improve • Higher clock speed • Dual core chip • EPIC architecture • Lot more improvement to expect for Memory Latency • More internal cache • Parallel execution engines • Parallelism pushed to compiler technology • ASE architecture makes use of new technology • Best OLTP engine • New optimizer and execution engine • Efficient handling of large data set

  40. Questions

More Related