1 / 78

ASE104: How ASE Works - SMP Support, Scheduling, and Task Management

ASE104: How ASE Works - SMP Support, Scheduling, and Task Management. David Wein Principal Product Support Engineer david.wein@sybase.com August 15-19, 2004. The Enterprise. Unwired. The Enterprise. Unwired. Industry and Cross Platform Solutions. Manage Information. Unwire Information.

calais
Download Presentation

ASE104: How ASE Works - SMP Support, Scheduling, and Task Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASE104: How ASE Works - SMP Support, Scheduling, and Task Management David Wein Principal Product Support Engineer david.wein@sybase.com August 15-19, 2004

  2. The Enterprise. Unwired.

  3. The Enterprise. Unwired. Industry and Cross Platform Solutions Manage Information Unwire Information Unwire People • Adaptive Server Enterprise • Adaptive Server Anywhere • Sybase IQ • Dynamic Archive • Dynamic ODS • Replication Server • OpenSwitch • Mirror Activator • PowerDesigner • Connectivity Options • EAServer • Industry Warehouse Studio • Unwired Accelerator • Unwired Orchestrator • Unwired Toolkit • Enterprise Portal • Real Time Data Services • SQL Anywhere Studio • M-Business Anywhere • Pylon Family (Mobile Email) • Mobile Sales • XcelleNet Frontline Solutions • PocketBuilder • PowerBuilder Family • AvantGo Sybase Workspace

  4. Goals and Agenda • Session Goal • Provide details on ASE’s internal task and scheduling architecture that will allow DBAs to more efficiently tune, manage, and monitor their ASE installations. • Session Agenda • ASE Architectural Overview • Task Scheduling • Disk and Network I/O Polling • Configuration and Monitoring

  5. ASE Server Layer RDBMS Engine ASE Kernel Layer Platform Independent “O/S” Operating System Physical Resources ASE Architectural Overview • Databases • Xacts • Logging • Locks • Users • Queries • Tasks • Engines • Scheduler • Spinlocks • Alarms • OS Services • CPU • Memory • Disks • Network • Signals • Disks

  6. ASE Kernel Services • ASE Kernel Provides a Generic Run-Time Environment for the RDBMS Engine • Task management / scheduling • SMP support (Engines and spinlocks) • Disk I/O • Network I/O • Memory management • System clock / time services / alarms • Signal handling • Native thread interface • Error log services • Security Integration (SSL, LDAP, PAM, Kerberos) • JVM host environment

  7. ASE Process Model ASE Implements a Highly Efficient Internal Threading Model • Overview • ASE creates customized internal “threads” or “tasks” for connections. • No O/S processes are spawned for incoming connections • Native O/S threads are not used for core processing • Windows platform is an exception • Benefits • Small footprint: no need for data segments, heaps, etc. • No context switching at the O/S • Consistent behavior across platforms • ASE has complete and total control over task scheduling

  8. Key Concepts - Engines • Internal ASE representation of an O/S process • Engines provide actual CPU time to internal ASE tasks • ASE supports up to 128 engines • Engines communicate via shared memory • An “ENGINE” itself is a C data structure that exists in ASE shared memory. • On NT, an O/S thread is spawned for each ENGINE, not a separate process • Engines contain: • Prioritized run queues • Deferred event queues • Lists of disk I/Os • Relevance for network I/Os • Housekeeper queues • Lots and lots of status information

  9. Key Concepts - Spinlock • Low-level synchronization mechanism • access to many internal data structures is protected with spinlocks • prevents multiple engines from stepping on each other • ex: one engine reading a page buffer while another engine is writing a new page into that buffer • implemented in assembly language for speed and efficiency • Spinlocks vs. logical locks (row, page, table, etc.) • logical locks: • blocked process goes to sleep • engine can run other tasks while lock is being held • lock holder wakes blocked sleepers when it is done • held for short or long duration • spinlocks: • blocked process “spins” continuously checking if the lock is available • engine can’t do anything else until it gets the spinlock • held for very short duration • Spinlock Contention • how often did I need to wait for a spinlock

  10. Key Concepts – Kernel Process • An internal ASE task / thread • Also called a KPROC, identified by a KPID…not a SPID. • Represented by a process context structure • Has a stack • The kernel process is what is actually run on an engine. • Fiber API is used for ASE tasks on NT. • KRPOCs contain: • Process state and priority • Engine affinity data • Pointer to it’s stack • Connection with a server level process (SPID) • A jump buffer for scheduling • Monitoring data for MDA tables • Other status information

  11. KPROC PSS KPID SPID USER & LOGIN ENGINE MASK OPEN TABLES JUMP BUFFER QUERY DATA MDA STATS U. LOG CACHE MDA STATS STACK send recv network buffers Kernel Process vs. Server Process • Kernel processes may have an associated server process • All server processes are associated to a kernel process • Server process is represented by a PSS, identified by a SPID

  12. Key Concepts - Scheduler • The scheduler is a special kernel process • decides which kernel process gets CPU time on an engine • one scheduler process is dedicated to each engine • polling for network and disk I/O completions occurs within the scheduler context Most of the presentation focuses on what the scheduler does and why.

  13. Agenda • ASE Architectural Overview • Task Scheduling • Disk and Network I/O Polling • Configuration and Monitoring

  14. Cooperative Processing Model ASE uses non-preemptive scheduling • A running task is responsible for scheduling itself out • Ensures that a task is only scheduled out when we want it to. • Two fundamental reasons a task schedules itself out: • It has to wait for a resource: • physical I/O to bring a page into cache • page lock on a table • a command to run from the client • etc. • considered to be an automatic yield • It has exceeded its time quantum • tasks periodically check how long they have been running • this is called a yield point • if this exceeds a configured limit, they yield. • this is counted as “voluntary yield”

  15. Execution Time Slice • time slice • time, in milliseconds, a task can run without yielding • if a task cross a yield point after it has exceeded the timeslice, it yields • internally tracked in clock ticks, rounded up • default is 100 ms • cpu grace time • number of clock ticks a task is allowed to run after it exceeds its time slice • allows a longer running task to hit a yield point or go to sleep • tasks that exceed their grace time are killed. • default is 500 ticks

  16. Time Time Tasks Yielding to the Scheduler Check for negative time slice counter Start a disk I/O time slice 0 -1 time slice 0 -2 -1 -2 Voluntary Yield Automatic Yield

  17. Time CPU Grace Time • tasks that exceed their time slice + cpu grace time are killed • prevents a run-away task from running forever • possible causes: • system call did not return (most common) • bug in code causing a loop (rare) • lack of sufficient yield points (very rare in modern ASE) 0 -1 -2 - grace time time slice Timeslice Error

  18. Dealing with Time Slice Errors • Recurring time slice errors should be pursued. • Try to find the root cause instead of covering them up • Don’t adjust the time slice parameter • ineffective solution • will affect all tasks, not only those that exceed grace time • cpu grace time should be a temporary solution • consider increasing grace time while the root cause is being investigated • default value of 500 ticks = 50 seconds. This is an eternity! • increasing this will simply cause a longer hang in many cases • Pursue the root cause!

  19. Clock Ticks Ticks are based on alarm signals delivered by O/S • Clock interrupt handler runs when alarm arrives • Performs very important accounting work • records if engine is idle,I/O busy, or CPU busy • decrements time slice counter for running process • kills a task that has exceed its grace time • engine 0 keeps track of alarms • Frequency of clock tick based on “sql server clock tick length” • specified in microseconds, default is 1/10th of a second • generally not recommended to adjust this • on very fast CPUs a lower value may improve CPU accounting

  20. kpid 1 tick (busy) kpid 2 kpid 1 tick idle (idle) kpid 3 kpid 2 idle tick (idle) Finer Points of CPU Accounting • Accounting is performed during clock tick • ASE doesn’t know how busy it was between ticks • The faster the CPU, the more we can “miss” • Implications of diagram on right • sp_sysmon will report 33% busy, 67% idle • kpids 2 and 3 accumulate zero CPU time • shouldn’t be significant in a 5 minute sp_sysmon • shorter clock tick length could improve accuracy, butr the tradeoff is additional overhead for accounting cpu busy = 33%

  21. Run Queue Overview • When a task becomes runnable it is placed on a run queue • Each engine has a set of prioritized run queues • There is also a global run queue • An engine can run a task in its own queues, the global queues, or another engine’s queues • Task to engine affinity may limit this behavior. • Tasks have a priority attribute that determines their run queue • Eight priority slots • Slot 0 is the HIGHEST priority, i.e. first considered for CPU time • Slot 7 is the LOWEST priority, i.e. last considered for CPU time

  22. Prioritized Run Queues KernelService Tasks Network Handler / Listeners 0 Logins,Site Handler, Async log tasks, wakes from lock or latch sleep, HA takeover 4 High Priority User Tasks Low Priority User Tasks 6 House-keeper (Wash & Chores only in 12.5.0.3) 7 Unused 1 2 Deadlock Tune 3 Default User Tasks, HK GC , misc server service tasks 5 Decreasing Priority

  23. Priority Changes • Tasks may change their priority • Strategic priority adjustment helps ASE run smoothly • Lock Sleep and Latch Sleep • tasks change to priority 3 prior to sleeping on a lock or latch • revert to original priority upon wake-up • high priority helps waking task get scheduled, reduces contention • Housekeeper queues filling • HK chore task may be “accelerated” to priority 5 if its queues reach > 80% full. • Login processing is done at priority 3 • helps newly spawning user process get CPU time to speed login • priority reset after login completes • Priority may be altered using the Logical Process Manager • discussed later…

  24. Global Queues and Task Stealing • Tasks may migrate between engines • Engines can run any task from local run queues, global run queues, or may steal tasks from another engine’s run queues. • Local tasks preferred due to locality of reference • local tasks at a given priority always have preference • Global queues help important tasks on SMP systems • global queue checked after local queue in priority order • local pri 0, global pri 0, local pri 1, global pri 1, etc. • waking priority 3 tasks are placed in the global queue • ensures they are run by first available engine • Tasks may be stolen from another engine • “idle” engines may check priority 5 queue of other engines • improves load balancing

  25. 0 1 2 3 4 5 6 7 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Basic Scheduling Algorithm We now know enough to give a basic look at the scheduling algorithm Global Queues • Alternate local and global queues by priority 0 1 2 3 4 5 6 7 • Cycle through priority 5 queues on other engines Why only priority 5? < 5 will likely be run locally > 5 is low priority 0 1 2 3 4 5 5 5 5 • Keep looping until a runnable task is found 6 7 Engine 0 Engine 1 Engine 2 Engine 3

  26. Task-to-Engine Affinity • Tasks may have affinity with an engine • Every task has an affinity stack and an engine mask that determines where it can run • Affinity stack specifies engine the task must run on • ASE may “push” and “pop” the affinity stack to force tasks to run on specific engines • Engine mask specifies engines the task can run on • may prevent some engines from pulling from global queues • limits task stealing between engines • default includes all engines in the mask • controlled with the Logical Process Manager

  27. Run Affinity vs. Network Affinity • This discussion of affinity only applies to run engine • Tasks have both a network engine and a run engine • Network engine affinity is hard and is fixed at login • New connections assigned to engine with fewest active • All network I/O happens on that engine • Network affinity does not migrate with run affinity • Task running on engine 3 may be doing network I/O on engine 2 • task on eng 3 writes data to internal buffer, goes to sleep • network handler on engine 2 wakes up, sends the data • task wakes up and is rescheduled, probably on eng 3 • Tasks running on network engine can write directly to the socket

  28. Network Affinity Assignment • Load balancing determines network affinity • ASE tries to evenly distribute network load • Multiple Network Engines or MNE • Each engine keeps count of tasks w/ network affinity • Engine with lowest count will receive next incoming connections • When new connections arrive: • Network handler runs on chosen engine • All pending connections are drained to that engine • could be more than 1 • Least loaded engine is recalculated • Network handler migrates to new least loaded engine, awaits connections

  29. Engine-to-CPU Affinity • ASE can affinity an engine to a specific CPU • Aspect of O/S scheduling and not ASE scheduling • Toggled via dbcc tune(“cpuaffinity”,cpu_id,”on | off”) • must be run after each boot • Not supported on HP-UX or Linux • Use available OS tools • Can be used to segregate ASE engines • May reduce some scheduling overhead at the OS • Best to work with OS vendor / admin on recommendations

  30. Soft vs. Hard Run Affinity • Soft run affinity is a scheduling preference • refers to desire to schedule a task on the engine where it last ran • reflects “local tasks first” scheduling algorithm • Hard run affinity is a scheduling requirement • task must run on affinity engine • most common reason is CIS • tasks using CIS will have hard affinity for life • access to MDA tables uses CIS • internal maintenance may require temporary hard affinity • loop through engines, pushing and popping affinity • ex: dynamically allocating additional shared memory • push affinity with eng 0, allocate shared memory • pop affinity back to original engine • for each engine: • push affinity (forces rescheduling on affinity engine) • attach to new shared memory • pop affinity

  31. How Affinity and Masking Affects Scheduling • Affinity and mask may prevent a task from being stolen (run by non-local engine) • engine mask compatibility • can’t run if I’m not in the mask • affinity to a different engine • can’t run if task has current affinity elsewhere • high priority global task may be skipped in lieu of low priority local task • affinity example: task with hard affinity waking from lock sleep won’t be placed in global queue. • mask example: eng 1 finds priority 4 task in global queue, but task does not have eng 1 in its mask…eng 1 runs local priority 5 task instead.

  32. Final Points on Selection of Runnable Task • Only steal task if its engine has > 1 runnable tasks • assume task would be quickly scheduled on its local engine • maintain locality of reference • Don’t schedule task that just yielded if others are waiting • applies when: • scheduling algorithm has picked from a local queue the task that just ran and • at least one task in the global run queues or • at least one other task in the engine local run queues • back-off mechanism • Prevents a single task from monopolizing engine • runnable queues on other engines not considered here

  33. Almost Complete Scheduling Algorithm LEGEND START check I/O goto next engine goto local pri 0 queue Engine Local Tasks deferred events just ran? N is it me? task? Y Global Tasks Y LOOP Y local or global tasks? N other tasks? N Y Y >1 task? Non-Local Engine (task stealing) goto global queue N N N Y Non-Task Related Activities deferred events task? mask okay? pri 5 task? Y Y N check I/O Note: additional decision points exist for exception checking and are therefore not significant to the algorithm N N Y last queue? mask / affinity okay? Y goto next local pri queue Y N run it! N

  34. Illustration: Scheduling a Local Task LEGEND START check I/O goto local pri 0 queue Engine Local Tasks deferred events just ran? task? Global Tasks local or global tasks? Non-Local Engine (task stealing) goto global queue Non-Task Related Activities task? Note: additional decision points exist for exception checking and are therefore not significant to the algorithm last queue? goto next local pri queue run it!

  35. Illustration: Task Stealing LEGEND START check I/O goto next engine Engine Local Tasks deferred events is it me? Global Tasks local or global tasks? >1 task? Non-Local Engine (task stealing) Non-Task Related Activities deferred events pri 5 task? check I/O Note: additional decision points exist for exception checking and are therefore not significant to the algorithm mask / affinity okay? run it!

  36. Agenda • ASE Architectural Overview • Task Scheduling • Disk and Network I/O Polling • Configuration and Monitoring

  37. Scheduler’s Non-Task Activities • Deferred Events • ASE may queue events for later processing • scheduler checks deferred event queue and takes appropriate action • this is lightweight and seldom relevant • Checking I/O • look for new connections, pending reads, pending writes on the network • look for completed disk I/O • this is a significant matter

  38. Deferred Events • Alarm handling is done a deferred basis • Alarms allow a task to sleep for a specified time • Ex: “waitfor delay”, checkpoint, site handler timeout, etc. • Clock interrupt routine places alarm processor on deferred queue • Only engine 0 processes alarms • Next time scheduler runs it executes the alarm processor from the deferred queue • alarms are ticked down and “expired” functions are run • typically wakes a sleeping task for normal scheduling • Out-Of-Band network traffic is deferred • Some disk I/Os are deferred

  39. Checking I/O • Checking for I/O is a primary job of the scheduler • Also referred to as “polling” • Network I/O • All open sockets for a given network type (TLI, TCP, etc.) are checked • Requests dispatched for pending connections, reads, and writes • Network types processed round-robin, one per I/O check • Disk I/O • Outstanding async I/Os issued by the engine are checked for completion • Completed disk I/Os are processed and sleeping tasks woken • Ct-Lib I/O • Ct-Lib I/O may be async • Scheduler polls for completed Ct-Lib I/O and invokes callback function

  40. Poll Based vs. Signal Based I/O • ASE uses poll based I/O for improved throughput • No signals are sent by O/S • ASE periodically checks for completion of I/O • Completed I/Os are processed • Improves the overall throughput of ASE • Lower ASE CPU consumption, higher O/S CPU consumption • Signal based I/O was used in older versions • O/S sends SIGIO interrupt upon I/O completion • ASE interrupt handler runs, sets I/O completion flag • Scheduler processes I/O completion • Higher ASE CPU consumption, lower O/S CPU consumption

  41. Illustration: Signal Based I/O ASE O/S call returns immediately task issues I/O INTERUPT I/O Completes, SIGIO sent task completes may require an O/S call scheduler processes I/O completion

  42. Illustration: Poll Based I/O ASE O/S call returns immediately task issues I/O I/O Completes task completes scheduler polls for I/O completion

  43. HP-UX I/O Model Recently Changed • 64-bit HP-UX did signal based network I/O until recently • ASE on HP-UX did signal based disk and network I/O on 11.0.3 • disk I/O poll based in 11.5.1 • network I/O poll based in 11.9.2 • 64-bit HP-UX still did signal based network I/O • 11.9.3 & 12.0 -> signal based (including current 12.0 EBFs) • 12.5, 12.5.0.1, 12.5.0.2 -> signal based • 12.5.0.3, 12.5.1, etc. -> poll based • O/S CPU consumption may increase when going from pre-12.5.0.3 HP-UX 64-bit to 12.5.0.3 and later HP-UX 64-bit • This is due to additional polling for network I/O • ASE should see improved throughput for environments with high rates of network I/O

  44. When to Poll for I/O • Scheduler checks for I/O at two points • Upon entry if either: • A clock tick has arrived since the last I/O check, or • I/O polling process count has been exceeded • # of processes an engine can schedule between I/O checks • Inside the “scheduling loop” • check I/O if no local or global runnable tasks • this point is continually crossed until the scheduler finds a task to run

  45. I/O Polling Process Count • Configurable via sp_configure, sets the max number of tasks an engine can run between I/O checks • default is 10 • can be used to balance CPU vs. I/O intensive jobs • has little effect on network polling due to back-off algorithm • only check once per tick when engine is not idle • impact is on disk I/O • reduction may improve respond time for disk I/O operations • increase may improve throughput • note that this value is rarely tuned

  46. Blocking vs. Non-Blocking Polling • Non-blocking polling improves throughput • Blocking calls put a process to sleep • O/S schedules process out until call returns • CPU is not consumed by the blocked process • ASE occasionally performs blocking network checks to reduce CPU consumption. • Non-blocking calls maintain CPU • calls return immediately • O/S does not schedule process out due to call • of course, O/S can schedule a process out whenever it decides to. • Application continues processing • CPU is consumed • Most ASE network checks are non-blocking • ASE normally does non-blocking disk checks (exception to be covered)

  47. Illustration: Blocking Network Check ASE O/S blocking call issued ASE scheduled out socket ready or timeout blocking call returns ASE scheduled in

  48. Illustration: Non-Blocking Network Check ASE O/S non-blocking poll call returns immediately non-blocking poll call returns immediately socket ready possible I/O latency non-blocking poll call returns immediately

  49. Performing Blocking Network Checks • ASE performs blocking network checks at idle time • Lowers O/S CPU consumption by the ASE engine • Trick is determining “idle” • not recognizing idle times wastes CPU • ASE will needlessly perform non-blocking polling • deciding busy time is idle introduces latency • ASE will perform blocking polling when it shouldn’t • Tunable via “runnable process search count” (RPSC) • Count of scheduler loops before it decides to block • considered a “CPU yield” • Use sp_configure

  50. Runnable Process Search Count • The scheduler runs in a loop • Refers to continuously looking for runnable tasks and polling the network when the engine is idle • Provides quickest response / lowest latency to new events • Consumes (wastes) a lot of CPU cycles • not an issue if CPU is dedicated to ASE • a big issue if ASE has to share the CPU, especially with other ASEs. • runnable process search count is the number of loops before a yield • after “rpsc” number of yields, every network check will block until ASE finds a task to run • higher value = less yields = more CPU consumption • next time into the scheduler we loop “rpsc” times again • default value is 2,000 (was 3 on AIX until 12.0.0.4, 12.5.0.1) • value of zero means never yield • some caveats w.r.t disk I/Os…covered later

More Related