ASE104: How ASE Works - SMP Support, Scheduling, and Task Management

ASE104: How ASE Works - SMP Support, Scheduling, and Task Management David Wein Principal Product Support Engineer david.wein@sybase.com August 15-19, 2004

The Enterprise. Unwired.

The Enterprise. Unwired. Industry and Cross Platform Solutions Manage Information Unwire Information Unwire People • Adaptive Server Enterprise • Adaptive Server Anywhere • Sybase IQ • Dynamic Archive • Dynamic ODS • Replication Server • OpenSwitch • Mirror Activator • PowerDesigner • Connectivity Options • EAServer • Industry Warehouse Studio • Unwired Accelerator • Unwired Orchestrator • Unwired Toolkit • Enterprise Portal • Real Time Data Services • SQL Anywhere Studio • M-Business Anywhere • Pylon Family (Mobile Email) • Mobile Sales • XcelleNet Frontline Solutions • PocketBuilder • PowerBuilder Family • AvantGo Sybase Workspace

Goals and Agenda • Session Goal • Provide details on ASE’s internal task and scheduling architecture that will allow DBAs to more efficiently tune, manage, and monitor their ASE installations. • Session Agenda • ASE Architectural Overview • Task Scheduling • Disk and Network I/O Polling • Configuration and Monitoring

ASE Server Layer RDBMS Engine ASE Kernel Layer Platform Independent “O/S” Operating System Physical Resources ASE Architectural Overview • Databases • Xacts • Logging • Locks • Users • Queries • Tasks • Engines • Scheduler • Spinlocks • Alarms • OS Services • CPU • Memory • Disks • Network • Signals • Disks

ASE Kernel Services • ASE Kernel Provides a Generic Run-Time Environment for the RDBMS Engine • Task management / scheduling • SMP support (Engines and spinlocks) • Disk I/O • Network I/O • Memory management • System clock / time services / alarms • Signal handling • Native thread interface • Error log services • Security Integration (SSL, LDAP, PAM, Kerberos) • JVM host environment

ASE Process Model ASE Implements a Highly Efficient Internal Threading Model • Overview • ASE creates customized internal “threads” or “tasks” for connections. • No O/S processes are spawned for incoming connections • Native O/S threads are not used for core processing • Windows platform is an exception • Benefits • Small footprint: no need for data segments, heaps, etc. • No context switching at the O/S • Consistent behavior across platforms • ASE has complete and total control over task scheduling

Key Concepts - Engines • Internal ASE representation of an O/S process • Engines provide actual CPU time to internal ASE tasks • ASE supports up to 128 engines • Engines communicate via shared memory • An “ENGINE” itself is a C data structure that exists in ASE shared memory. • On NT, an O/S thread is spawned for each ENGINE, not a separate process • Engines contain: • Prioritized run queues • Deferred event queues • Lists of disk I/Os • Relevance for network I/Os • Housekeeper queues • Lots and lots of status information

Key Concepts - Spinlock • Low-level synchronization mechanism • access to many internal data structures is protected with spinlocks • prevents multiple engines from stepping on each other • ex: one engine reading a page buffer while another engine is writing a new page into that buffer • implemented in assembly language for speed and efficiency • Spinlocks vs. logical locks (row, page, table, etc.) • logical locks: • blocked process goes to sleep • engine can run other tasks while lock is being held • lock holder wakes blocked sleepers when it is done • held for short or long duration • spinlocks: • blocked process “spins” continuously checking if the lock is available • engine can’t do anything else until it gets the spinlock • held for very short duration • Spinlock Contention • how often did I need to wait for a spinlock

Key Concepts – Kernel Process • An internal ASE task / thread • Also called a KPROC, identified by a KPID…not a SPID. • Represented by a process context structure • Has a stack • The kernel process is what is actually run on an engine. • Fiber API is used for ASE tasks on NT. • KRPOCs contain: • Process state and priority • Engine affinity data • Pointer to it’s stack • Connection with a server level process (SPID) • A jump buffer for scheduling • Monitoring data for MDA tables • Other status information

KPROC PSS KPID SPID USER & LOGIN ENGINE MASK OPEN TABLES JUMP BUFFER QUERY DATA MDA STATS U. LOG CACHE MDA STATS STACK send recv network buffers Kernel Process vs. Server Process • Kernel processes may have an associated server process • All server processes are associated to a kernel process • Server process is represented by a PSS, identified by a SPID

Key Concepts - Scheduler • The scheduler is a special kernel process • decides which kernel process gets CPU time on an engine • one scheduler process is dedicated to each engine • polling for network and disk I/O completions occurs within the scheduler context Most of the presentation focuses on what the scheduler does and why.

Agenda • ASE Architectural Overview • Task Scheduling • Disk and Network I/O Polling • Configuration and Monitoring

Cooperative Processing Model ASE uses non-preemptive scheduling • A running task is responsible for scheduling itself out • Ensures that a task is only scheduled out when we want it to. • Two fundamental reasons a task schedules itself out: • It has to wait for a resource: • physical I/O to bring a page into cache • page lock on a table • a command to run from the client • etc. • considered to be an automatic yield • It has exceeded its time quantum • tasks periodically check how long they have been running • this is called a yield point • if this exceeds a configured limit, they yield. • this is counted as “voluntary yield”

Execution Time Slice • time slice • time, in milliseconds, a task can run without yielding • if a task cross a yield point after it has exceeded the timeslice, it yields • internally tracked in clock ticks, rounded up • default is 100 ms • cpu grace time • number of clock ticks a task is allowed to run after it exceeds its time slice • allows a longer running task to hit a yield point or go to sleep • tasks that exceed their grace time are killed. • default is 500 ticks

Time Time Tasks Yielding to the Scheduler Check for negative time slice counter Start a disk I/O time slice 0 -1 time slice 0 -2 -1 -2 Voluntary Yield Automatic Yield

Time CPU Grace Time • tasks that exceed their time slice + cpu grace time are killed • prevents a run-away task from running forever • possible causes: • system call did not return (most common) • bug in code causing a loop (rare) • lack of sufficient yield points (very rare in modern ASE) 0 -1 -2 - grace time time slice Timeslice Error

Dealing with Time Slice Errors • Recurring time slice errors should be pursued. • Try to find the root cause instead of covering them up • Don’t adjust the time slice parameter • ineffective solution • will affect all tasks, not only those that exceed grace time • cpu grace time should be a temporary solution • consider increasing grace time while the root cause is being investigated • default value of 500 ticks = 50 seconds. This is an eternity! • increasing this will simply cause a longer hang in many cases • Pursue the root cause!

Clock Ticks Ticks are based on alarm signals delivered by O/S • Clock interrupt handler runs when alarm arrives • Performs very important accounting work • records if engine is idle,I/O busy, or CPU busy • decrements time slice counter for running process • kills a task that has exceed its grace time • engine 0 keeps track of alarms • Frequency of clock tick based on “sql server clock tick length” • specified in microseconds, default is 1/10th of a second • generally not recommended to adjust this • on very fast CPUs a lower value may improve CPU accounting

kpid 1 tick (busy) kpid 2 kpid 1 tick idle (idle) kpid 3 kpid 2 idle tick (idle) Finer Points of CPU Accounting • Accounting is performed during clock tick • ASE doesn’t know how busy it was between ticks • The faster the CPU, the more we can “miss” • Implications of diagram on right • sp_sysmon will report 33% busy, 67% idle • kpids 2 and 3 accumulate zero CPU time • shouldn’t be significant in a 5 minute sp_sysmon • shorter clock tick length could improve accuracy, butr the tradeoff is additional overhead for accounting cpu busy = 33%

Run Queue Overview • When a task becomes runnable it is placed on a run queue • Each engine has a set of prioritized run queues • There is also a global run queue • An engine can run a task in its own queues, the global queues, or another engine’s queues • Task to engine affinity may limit this behavior. • Tasks have a priority attribute that determines their run queue • Eight priority slots • Slot 0 is the HIGHEST priority, i.e. first considered for CPU time • Slot 7 is the LOWEST priority, i.e. last considered for CPU time

Prioritized Run Queues KernelService Tasks Network Handler / Listeners 0 Logins,Site Handler, Async log tasks, wakes from lock or latch sleep, HA takeover 4 High Priority User Tasks Low Priority User Tasks 6 House-keeper (Wash & Chores only in 12.5.0.3) 7 Unused 1 2 Deadlock Tune 3 Default User Tasks, HK GC , misc server service tasks 5 Decreasing Priority

Priority Changes • Tasks may change their priority • Strategic priority adjustment helps ASE run smoothly • Lock Sleep and Latch Sleep • tasks change to priority 3 prior to sleeping on a lock or latch • revert to original priority upon wake-up • high priority helps waking task get scheduled, reduces contention • Housekeeper queues filling • HK chore task may be “accelerated” to priority 5 if its queues reach > 80% full. • Login processing is done at priority 3 • helps newly spawning user process get CPU time to speed login • priority reset after login completes • Priority may be altered using the Logical Process Manager • discussed later…

Global Queues and Task Stealing • Tasks may migrate between engines • Engines can run any task from local run queues, global run queues, or may steal tasks from another engine’s run queues. • Local tasks preferred due to locality of reference • local tasks at a given priority always have preference • Global queues help important tasks on SMP systems • global queue checked after local queue in priority order • local pri 0, global pri 0, local pri 1, global pri 1, etc. • waking priority 3 tasks are placed in the global queue • ensures they are run by first available engine • Tasks may be stolen from another engine • “idle” engines may check priority 5 queue of other engines • improves load balancing

0 1 2 3 4 5 6 7 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Basic Scheduling Algorithm We now know enough to give a basic look at the scheduling algorithm Global Queues • Alternate local and global queues by priority 0 1 2 3 4 5 6 7 • Cycle through priority 5 queues on other engines Why only priority 5? < 5 will likely be run locally > 5 is low priority 0 1 2 3 4 5 5 5 5 • Keep looping until a runnable task is found 6 7 Engine 0 Engine 1 Engine 2 Engine 3

Task-to-Engine Affinity • Tasks may have affinity with an engine • Every task has an affinity stack and an engine mask that determines where it can run • Affinity stack specifies engine the task must run on • ASE may “push” and “pop” the affinity stack to force tasks to run on specific engines • Engine mask specifies engines the task can run on • may prevent some engines from pulling from global queues • limits task stealing between engines • default includes all engines in the mask • controlled with the Logical Process Manager

Run Affinity vs. Network Affinity • This discussion of affinity only applies to run engine • Tasks have both a network engine and a run engine • Network engine affinity is hard and is fixed at login • New connections assigned to engine with fewest active • All network I/O happens on that engine • Network affinity does not migrate with run affinity • Task running on engine 3 may be doing network I/O on engine 2 • task on eng 3 writes data to internal buffer, goes to sleep • network handler on engine 2 wakes up, sends the data • task wakes up and is rescheduled, probably on eng 3 • Tasks running on network engine can write directly to the socket

Network Affinity Assignment • Load balancing determines network affinity • ASE tries to evenly distribute network load • Multiple Network Engines or MNE • Each engine keeps count of tasks w/ network affinity • Engine with lowest count will receive next incoming connections • When new connections arrive: • Network handler runs on chosen engine • All pending connections are drained to that engine • could be more than 1 • Least loaded engine is recalculated • Network handler migrates to new least loaded engine, awaits connections

Engine-to-CPU Affinity • ASE can affinity an engine to a specific CPU • Aspect of O/S scheduling and not ASE scheduling • Toggled via dbcc tune(“cpuaffinity”,cpu_id,”on | off”) • must be run after each boot • Not supported on HP-UX or Linux • Use available OS tools • Can be used to segregate ASE engines • May reduce some scheduling overhead at the OS • Best to work with OS vendor / admin on recommendations

Soft vs. Hard Run Affinity • Soft run affinity is a scheduling preference • refers to desire to schedule a task on the engine where it last ran • reflects “local tasks first” scheduling algorithm • Hard run affinity is a scheduling requirement • task must run on affinity engine • most common reason is CIS • tasks using CIS will have hard affinity for life • access to MDA tables uses CIS • internal maintenance may require temporary hard affinity • loop through engines, pushing and popping affinity • ex: dynamically allocating additional shared memory • push affinity with eng 0, allocate shared memory • pop affinity back to original engine • for each engine: • push affinity (forces rescheduling on affinity engine) • attach to new shared memory • pop affinity

How Affinity and Masking Affects Scheduling • Affinity and mask may prevent a task from being stolen (run by non-local engine) • engine mask compatibility • can’t run if I’m not in the mask • affinity to a different engine • can’t run if task has current affinity elsewhere • high priority global task may be skipped in lieu of low priority local task • affinity example: task with hard affinity waking from lock sleep won’t be placed in global queue. • mask example: eng 1 finds priority 4 task in global queue, but task does not have eng 1 in its mask…eng 1 runs local priority 5 task instead.

Final Points on Selection of Runnable Task • Only steal task if its engine has > 1 runnable tasks • assume task would be quickly scheduled on its local engine • maintain locality of reference • Don’t schedule task that just yielded if others are waiting • applies when: • scheduling algorithm has picked from a local queue the task that just ran and • at least one task in the global run queues or • at least one other task in the engine local run queues • back-off mechanism • Prevents a single task from monopolizing engine • runnable queues on other engines not considered here

Almost Complete Scheduling Algorithm LEGEND START check I/O goto next engine goto local pri 0 queue Engine Local Tasks deferred events just ran? N is it me? task? Y Global Tasks Y LOOP Y local or global tasks? N other tasks? N Y Y >1 task? Non-Local Engine (task stealing) goto global queue N N N Y Non-Task Related Activities deferred events task? mask okay? pri 5 task? Y Y N check I/O Note: additional decision points exist for exception checking and are therefore not significant to the algorithm N N Y last queue? mask / affinity okay? Y goto next local pri queue Y N run it! N

Illustration: Scheduling a Local Task LEGEND START check I/O goto local pri 0 queue Engine Local Tasks deferred events just ran? task? Global Tasks local or global tasks? Non-Local Engine (task stealing) goto global queue Non-Task Related Activities task? Note: additional decision points exist for exception checking and are therefore not significant to the algorithm last queue? goto next local pri queue run it!

Illustration: Task Stealing LEGEND START check I/O goto next engine Engine Local Tasks deferred events is it me? Global Tasks local or global tasks? >1 task? Non-Local Engine (task stealing) Non-Task Related Activities deferred events pri 5 task? check I/O Note: additional decision points exist for exception checking and are therefore not significant to the algorithm mask / affinity okay? run it!

Agenda • ASE Architectural Overview • Task Scheduling • Disk and Network I/O Polling • Configuration and Monitoring

Scheduler’s Non-Task Activities • Deferred Events • ASE may queue events for later processing • scheduler checks deferred event queue and takes appropriate action • this is lightweight and seldom relevant • Checking I/O • look for new connections, pending reads, pending writes on the network • look for completed disk I/O • this is a significant matter

Deferred Events • Alarm handling is done a deferred basis • Alarms allow a task to sleep for a specified time • Ex: “waitfor delay”, checkpoint, site handler timeout, etc. • Clock interrupt routine places alarm processor on deferred queue • Only engine 0 processes alarms • Next time scheduler runs it executes the alarm processor from the deferred queue • alarms are ticked down and “expired” functions are run • typically wakes a sleeping task for normal scheduling • Out-Of-Band network traffic is deferred • Some disk I/Os are deferred

Checking I/O • Checking for I/O is a primary job of the scheduler • Also referred to as “polling” • Network I/O • All open sockets for a given network type (TLI, TCP, etc.) are checked • Requests dispatched for pending connections, reads, and writes • Network types processed round-robin, one per I/O check • Disk I/O • Outstanding async I/Os issued by the engine are checked for completion • Completed disk I/Os are processed and sleeping tasks woken • Ct-Lib I/O • Ct-Lib I/O may be async • Scheduler polls for completed Ct-Lib I/O and invokes callback function

Poll Based vs. Signal Based I/O • ASE uses poll based I/O for improved throughput • No signals are sent by O/S • ASE periodically checks for completion of I/O • Completed I/Os are processed • Improves the overall throughput of ASE • Lower ASE CPU consumption, higher O/S CPU consumption • Signal based I/O was used in older versions • O/S sends SIGIO interrupt upon I/O completion • ASE interrupt handler runs, sets I/O completion flag • Scheduler processes I/O completion • Higher ASE CPU consumption, lower O/S CPU consumption

Illustration: Signal Based I/O ASE O/S call returns immediately task issues I/O INTERUPT I/O Completes, SIGIO sent task completes may require an O/S call scheduler processes I/O completion

Illustration: Poll Based I/O ASE O/S call returns immediately task issues I/O I/O Completes task completes scheduler polls for I/O completion

HP-UX I/O Model Recently Changed • 64-bit HP-UX did signal based network I/O until recently • ASE on HP-UX did signal based disk and network I/O on 11.0.3 • disk I/O poll based in 11.5.1 • network I/O poll based in 11.9.2 • 64-bit HP-UX still did signal based network I/O • 11.9.3 & 12.0 -> signal based (including current 12.0 EBFs) • 12.5, 12.5.0.1, 12.5.0.2 -> signal based • 12.5.0.3, 12.5.1, etc. -> poll based • O/S CPU consumption may increase when going from pre-12.5.0.3 HP-UX 64-bit to 12.5.0.3 and later HP-UX 64-bit • This is due to additional polling for network I/O • ASE should see improved throughput for environments with high rates of network I/O

When to Poll for I/O • Scheduler checks for I/O at two points • Upon entry if either: • A clock tick has arrived since the last I/O check, or • I/O polling process count has been exceeded • # of processes an engine can schedule between I/O checks • Inside the “scheduling loop” • check I/O if no local or global runnable tasks • this point is continually crossed until the scheduler finds a task to run

I/O Polling Process Count • Configurable via sp_configure, sets the max number of tasks an engine can run between I/O checks • default is 10 • can be used to balance CPU vs. I/O intensive jobs • has little effect on network polling due to back-off algorithm • only check once per tick when engine is not idle • impact is on disk I/O • reduction may improve respond time for disk I/O operations • increase may improve throughput • note that this value is rarely tuned

Blocking vs. Non-Blocking Polling • Non-blocking polling improves throughput • Blocking calls put a process to sleep • O/S schedules process out until call returns • CPU is not consumed by the blocked process • ASE occasionally performs blocking network checks to reduce CPU consumption. • Non-blocking calls maintain CPU • calls return immediately • O/S does not schedule process out due to call • of course, O/S can schedule a process out whenever it decides to. • Application continues processing • CPU is consumed • Most ASE network checks are non-blocking • ASE normally does non-blocking disk checks (exception to be covered)

Illustration: Blocking Network Check ASE O/S blocking call issued ASE scheduled out socket ready or timeout blocking call returns ASE scheduled in

Illustration: Non-Blocking Network Check ASE O/S non-blocking poll call returns immediately non-blocking poll call returns immediately socket ready possible I/O latency non-blocking poll call returns immediately

Performing Blocking Network Checks • ASE performs blocking network checks at idle time • Lowers O/S CPU consumption by the ASE engine • Trick is determining “idle” • not recognizing idle times wastes CPU • ASE will needlessly perform non-blocking polling • deciding busy time is idle introduces latency • ASE will perform blocking polling when it shouldn’t • Tunable via “runnable process search count” (RPSC) • Count of scheduler loops before it decides to block • considered a “CPU yield” • Use sp_configure

Runnable Process Search Count • The scheduler runs in a loop • Refers to continuously looking for runnable tasks and polling the network when the engine is idle • Provides quickest response / lowest latency to new events • Consumes (wastes) a lot of CPU cycles • not an issue if CPU is dedicated to ASE • a big issue if ASE has to share the CPU, especially with other ASEs. • runnable process search count is the number of loops before a yield • after “rpsc” number of yields, every network check will block until ASE finds a task to run • higher value = less yields = more CPU consumption • next time into the scheduler we loop “rpsc” times again • default value is 2,000 (was 3 on AIX until 12.0.0.4, 12.5.0.1) • value of zero means never yield • some caveats w.r.t disk I/Os…covered later

ASE104: How ASE Works - SMP Support, Scheduling, and Task Management

ASE104: How ASE Works - SMP Support, Scheduling, and Task Management

Presentation Transcript

Emotional Stroop Task

Project Scheduling: PERT/CPM

Chapter 8 Project Scheduling

Transit Scheduling Manual

Introduction

Task-based Language Teaching (TBLT)

4 Greedy Algorithms

Scheduling

Chapter 5, CPU Scheduling

Chapter 14: Project Management

Operations Management

P roject Management

Chapter 5: CPU Scheduling

Learning Outcomes

Scheduling

Code Scheduling

Scheduling and Policing 2007

Chapter 8 Operations Scheduling

Operational Risk Management (ORM) and the Driving Task