Optimizing Sharing Patterns and Locality via Thread Migration

Optimizing Sharing Patterns and Locality via Thread Migration Vadim Gleizer Supervisor: Prof. Assaf Schuster

Contributions of this research • Internal Distributed Shared Memory (DSM) Mechanisms • Thread Migration (TM) in DSM Systems • Load Balancing in DSM Systems

Internal DSM Mechanisms • An internal DSM mechanism or a DSM handler is responsible to guarantee the consistent memory view on each workstation as follows: • When a DSM region becomes invalid it is protected • Each access to the protected area will cause an exception • The internal DSM mechanism catches and handles these exceptions

Implementation of DSM Handlers • An exception handling service which is provided by an operating system significantly simplifies this task • Let us see the Win32 Structured Exception Handling (SEH) service of Windows NT: • A block of code that allowed to use DSM is wrapped by an exception block using Win32 __try/__except keywords similarly to try/catch blocks in C++: __try{ user_main();}__except(DSM_handler()); • Let us see how such services work and the drawbacks of using them

Inside the SEH Service • For each type of exceptions CPU generates a code, e.g. division by zero has code 0; page fault has code E; a GPF (General Protection Fault) exception has code D: • In the case of a page fault exception a _KiTrap0E is called

Inside the SEH Service(cont.) • The following sequence of calls occurs before the control is passed to the DSM_handler: • _KiTrap0E • KiUserExceptionDispatcher • RtlDispatchException • RtlpExecuteHandlerForException • ExecuteHandler • __except_handler3 • DSM_handler

Drawbacks of using SEH in DSM Systems • Performance • The SEH service is highly time-consuming while most of its functionality is unnecessary for the DSM handler • User’s exception handlers are called before the DSM handler • The programmer may accidentally intercept a DSM exception • The internal DSM handler should work transparently to the programmer • Thus, if the programmer does not know that the DSM handler uses SEH – he/she may accidentally intercept a DSM exception

User-Mode First-Chance Exception Handling • UMFC-EH: • Only kernel level part of SEH is used, i.e. the DSM_handler is called directly by _KiTrap0E • Thus, exceptions intercepted before any of the SEH user-mode functions is called: • _KiTrap0E, DSM_handler • Instead of _KiTrap0E, KiUserExceptionDispatcher, RtlDispatchException, RtlpExecuteHandlerForException, ExecuteHandler, __except_handler3, DSM_handler • To implement this scheme the detours library may be used

UMFC-EH (cont.) • Advantages: • Solves both drawbacks of the SEH service • No __try/__except blocks are needed • Drawbacks: • The kernel level part of SEH still is used • All exceptions are intercepted, e.g., division by zero

Kernel-Mode First-Chance Exception Handling • KMFC-EH: • Exceptions intercepted at kernel-mode by a special supervisor-level device driver, we call it DSM_filter • The DSM_filter informs the DSM_handler about DSM exceptions • Thus, the SEH service is not used

KMFC-EH (cont.) • Advantages: • preserves all the advantages of the UMFC-EH scheme • SEH is not used, i.e., the CPU directly informs the DSM_filter about page fault exceptions • only page fault exceptions are intercepted • Drawbacks: • all page fault exceptions are intercepted by the DSM_filter, including those of other processes • fortunately the overhead of this drawback is low

Performance Evaluation • Our experimental environment consists of the Millipede 4.0 DSM system: • cluster of eight uniprocessor workstations interconnected by a switched Myrinet LAN • Each workstation equipped with: • Pentium-II 300MHz • 128MB of RAM • 512KB of L2 cache • Windows NT 4.0 SP6 operating system • We have tested our DSM handlers on several commonly used for DSM benchmarks and microbenchmarks

PerformanceEvaluation (cont.) • Microbenchmarks (100000 page faults): • Related results (Brazos):

PerformanceEvaluation (cont.)

Thread Migration (TM) in DSM Systems Introduction: • A thread is stopped at almost every moment of its execution and launched on another machine from the same point where it was stopped • Applications of this facility: • load balancing • communication reduction • fault tolerance • cluster management • powerful programming primitive

Designing a TM Mechanism • Restrictions on TM – there are some situations in which the migration makes no sense: • the thread owns some local operating system resources, e.g. a synchronization object • the thread executes a local dependent operation, e.g. prints a message • Therefore the programmer should be aware of thread migration and explicitly mark situations when a thread cannot migrate

Designing a TM Mechanism (cont.) • A state of a thread consists of: • code • global data • heap data • stack data • processor’s register set • other thread specific data

Designing a TM Mechanism (cont.) Host 1 Host 2 1000 1004 2000 2004 A A

Designing a TM Mechanism (cont.) • Stack address translation • Drawbacks: • register values and stack values have to be investigated and probably updated (very inefficient for large stacks) • identification of pointers (correctness, a value may resemble a pointer), possible solutions: • special compiler or hardware support – more complex compiled code, often prevents compiler optimizations • special programming primitives that register all pointers – harms efficiency and simplicity of programming, limit free usage of pointers • the whole stack has to be copied at migration time

Designing a TM Mechanism (cont.) • Creating all mobile threads at DSM initialization time • Advantages: • no pointer investigation and modification • Drawbacks: • lack of scalability – the maximum number of threads are created on each host • lack of portability – may not work in future versions of the same operating system and cannot be used for heterogeneous systems • the whole stack has to be copied at migration

Designing a TM Mechanism (cont.) • Placement of stacks in a predefined memory region • Advantages: • no pointer investigation and modification • scalability – threads are created on application demand or at migration time • portability • Drawbacks: • the whole stack has to be copied at migration

Designing a TM Mechanism (cont.) • Placement of stacks in a DSM region • Advantages: • preserves all the advantages of the previous approach • the stack has not to be copied at migration

Implementation of TM • Placement of stacks in a predefined memory region or the default stack approach • the same address region is reserved at initialization time of DSM on each host • at creation each thread receives a slot for the stack according to its id • UNIX-like operating systems provide inside their thread creation API an option to control stack location • this approach is difficult to implement in Windows NT since there is no any conventional way to control stack location

Implementation of TM (cont.) • Stack location control in Windows NT • an application asks the DSM system to create a thread • the thread is created in suspended state (the initial stack is empty) • the address of initial stack is obtained through its ESP register and freed • the value of the ESP register is changed to a new stack location • a pointer to the Win32 data structure –Thread Information Block (TIB)– is obtained through the FS register • two fields inside the TIB are modified accordingly: pvStackUserTop and pvStackUserBase • the thread is resumed

Implementation of TM (cont.) • Placement of stacks in a DSM region • a separate region is added to DSM • a stack location of a thread is changed to be a slot inside the new DSM region similarly to the previous approach • however the stack cannot be handled as a regular DSM region

host 1 host 2 thread A migrates Implementation of TM (cont.) • Why a thread’s stack cannot be handled as a regular DSM region? Let us see an example: • thread A migrates from host 1 to host 2 • the stack of thread A remains on host 1 since it is placed on DSM; therefore the first access to the stack will cause a page fault exception • DSM_handler should be called in order to bring the missing part of the stack • however the stack is protected and DSM_handler cannot be called in a regular way ...

Implementation of TM (cont.) • The auxiliary stack approach: • this approach is based on the KMFC-EH technique • a memory region is allocated at initialization time of DSM on each host, called the auxiliary stacks region • page fault exceptions are intercepted by DSM_filter (driver) at kernel-level • when an exception has occurred on a stack DSM_filter changes the stack location of the thread to be a slot inside the auxiliary stacks region and calls DSM_handler • DSM_handler brings the page for the original stack, sets appropriate protection, switches the stack back and transfers control to the thread

TM in the Millipede 4.0 DSM System • In sum, our TM mechanism has the following powerful features: • two TM approaches • kernel-level threads being migrated • SEH support • the FastMessages service is used to efficiently transfer of migrating threads • thread suspension and resumption are location independent and may be recursive • supporting safety of all API functions provided by Millipede 4.0 • statistics tool

Performance Evaluation

Performance Evaluation (cont.) • The cost of Win32 calls used in TM: Averaging over 1,000,000 instances of each call • Performance of TM in Millipede 4.0: Averaging over 1,000,000 of TMs with stack size of 176B

Performance Evaluation (cont.) • Migration Time on Various Systems as function of • stack size (sec):

Load Balancing (LB) in DSM Systems Introduction: • Definition of load in DSM systems: • the CPU time that a computational thread consumes • the amount of communication that the thread causes during its work • Dynamic load sharing • computes a less precise location scheme of threads, but due to the relaxed requirements can often be as efficient as dynamic load balancing

14 14 13 13 15 15 12 12 1 1 11 11 9 9 5 5 10 10 2 2 8 8 3 3 4 4 6 6 7 7 14 13 15 12 1 11 9 5 10 2 8 6 3 4 7 Introduction (cont.)

Designing an LS Mechanism • The Goals of Load Sharing • A uniform distribution of threads among the stations • Minimization of communication overheads • Improving the locality of accesses • Avoiding page ping-pongs situations, in which a page is transferred frequently among several hosts

Designing an LS Mechanism (cont.) • We propose a load sharing mechanism that works as a separate module, called the Load Sharing Module (LS-Module). • The LS-Module performs the following tasks: • load imbalance detection • load imbalance treatment • ping-pong detection • ping-pong treatment

Designing an LS Mechanism (cont.) • Load Imbalance Detection protocol has a centralized entity called the Load Sharing Server (LS-Server) that • knows the power parameter of each host • notified by an external module on each change in the load • for each change in the load calculates two threshold values l and h of a host, in this way determining whether the host is normally loaded • begins load imbalance treatment protocol when load imbalance is detected

Designing an LS Mechanism (cont.) • Load Imbalance Treatment protocol is performed by the LS-Server which decides how many threads, say n, should be migrated from an overloaded host, say H1 to balance its load • An entity called Load Sharing Client (LS-Client) that runs on each host is responsible for selecting n threads whose migration will best minimize future communication

Designing an LS Mechanism (cont.) • Ping-Pong Detection protocol is performed by the Ping-Pong Client (PP- Client) entity • Each time there is an access to a remote page the PP-Client (one per host) is invoked • A ping-pong situation exists when the following two conditions are met: • local threads attempt to access a page a short time after it leaves the host • a page leaves the host a short time after it has arrived

Designing an LS Mechanism (cont.) • Ping-Pong Treatment protocol is performed by a centralized Ping-Pong Server (PP-Server) entity • The PP-Server determines which group of threads is participate in a ping-pong, then it chooses a destination host and migrates the threads to this host • If too many threads participate in a ping-pong or a ping-pong is detected a short time after it has been resolved, the PP-Server decides to treat the ping-pong using  delays

We have implemented the load sharing mechanism in the Millipede 4.0 DSM system Millipede 4.0 architecture The Thread-Server module The TM module The LS module: one centralized LS-Server LS-Clients (one per host) PP-Clients (one per host) LS in the Millipede 4.0 DSM System

LS in the Millipede 4.0 DSM System (cont.) • Access History • In order to select the threads for migration, for each thread we keep an access history • The access history contains at most one entry for each page that was referenced by the local threads in last Tepoch time units • Obviously the access history should be updated as time passes • The access history keeps also an old history or prehistory • summarizes the old access history of a thread

Page 0x0DCC 0:12:00 0:12:01 0:12:13 Thread 0 Page 0xACDC Thread 7 . . . Prehistory . . . LS in the Millipede 4.0 DSM System (cont.) • Access History Structure

LS in the Millipede 4.0 DSM System (cont.) • Thread Selection Algorithm • A heuristic value h(j) is calculated for each thread j on the local host L. It takes into account the following characteristic: • Maximal frequency of remote references to pages on R • Minimal access frequency of the threads remaining in L to the pages used by the selected threads • Minimal access frequency to local pages • Maximal frequency of any remote references • Until enough threads are selected, the following procedure is performed: • The thread j having the maximal value h(j) is chosen • The heuristic value of each thread i that has not yet been selected is revised, taking into account migration of j

Tunused Twaiting Tuseful send page P to Hi access page P (bring it from Hj) receive page P from Hj send page P to Hk Tunused + Tuseful < S Twaiting PPRatio = LS in the Millipede 4.0 DSM System (cont.) • Ping-Pong Detection Page ping-pong condition is: (S is called the sensitivity of the ping-pong)

·Nthpp P = f (Nth) LS in the Millipede 4.0 DSM System (cont.) • Dynamic calculation of  for page P • The value of  depends on the number of threads that are using the page and on their behavior is a constant; Nthpp the number of threads involved in the ping-pong residing on the local host Nth the total number of threads residing on the local host f (Nth) is the function of that number

Performance Evaluation • We have tested the LS module on several benchmarks that are common in DSM systems, as well as on synthetic microbenchmarks specially designed for this purpose • We refer to the version of Millipede 4.0 with LS module as the LS version and to the version without the LS module as no LS version

PerformanceEvaluation (cont.) • Microbenchmark applications were designed to simulate various load imbalance situations • Using microbenchmark applications we have measured the individual performance of each part of the load sharing protocol: • load imbalance treatment • ping-pong treatment: • locality optimization part • stabilization part

PerformanceEvaluation (cont.) • Locality optimization protocol

Optimizing Sharing Patterns and Locality via Thread Migration