450 likes | 535 Views
Multiprocessors and Multithreading – classroom slides. Example use of threads - 1. compute thread. I/O thread. compute. I/O request. I/O. I/O complete. I/O result Needed. I/O result Needed. compute. (a) Sequential process. (b) Multithreaded process.
E N D
Example use of threads - 1 compute thread I/O thread compute I/O request I/O I/O complete I/O result Needed I/O result Needed compute (a) Sequential process (b) Multithreaded process
Example use of threads - 2 Digitizer Tracker Alarm
main thread Main thread foo thread thread_create(foo, args) thread_create(foo, args) (b) After thread creation (a) Before thread creation Programming Support for Threads • creation • pthread_create(top-level procedure, args) • termination • return from top-level procedure • explicit kill • rendezvous • creator can wait for children • pthread_join(child_tid) • synchronization • mutex • condition variables
Sample program – thread create/join int foo(int n) { ..... return 0; } int main() { int f; thread_type child_tid; ..... child_tid = thread_create (foo, &f); ..... thread_join(child_tid); }
producer consumer buffer Programming with Threads • synchronization • for coordination of the threads • communication • for inter-thread sharing of data • threads can be in different processors • how to achieve sharing in SMP? • software: accomplished by keeping all threads in the same address space by the OS • hardware: accomplished by hardware shared memory and coherent caches
Need for Synchronization digitizer() { image_type dig_image; int tail = 0; loop { if (bufavail > 0) { grab(dig_image); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; bufavail = bufavail - 1; } } } tracker() { image_type track_image; int head = 0; loop { if (bufavail < MAX) { track_image = frame_buf[head mod MAX]; head = head + 1; bufavail = bufavail + 1; analyze(track_image); } } } Problem?
frame_buf 0 99 …… tail head (First empty spot in frame_buf) (First valid filled frame in frame_buf) digitizer tracker bufavail = bufavail + 1; bufavail = bufavail – 1; bufavail Shared data structure
Synchronization Primitives • lock and unlock • mutual exclusion among threads • busy-waiting Vs. blocking • pthread_mutex_trylock: no blocking • pthread_mutex_lock: blocking • pthread_mutex_unlock
Fix number 1 – with locks digitizer() { image_type dig_image; int tail = 0; loop { thread_mutex_lock(buflock); if (bufavail > 0) { grab(dig_image); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; bufavail = bufavail - 1; } thread_mutex_unlock(buflock); } } tracker() ( image_type track_image; int head = 0; loop { thread_mutex_lock(buflock); if (bufavail < MAX) { track_image = frame_buf[head mod MAX]; head = head + 1; bufavail = bufavail + 1; analyze(track_image); } thread_mutex_unlock(buflock); } } Problem?
Fix number 2 digitizer() { image_type dig_image; int tail = 0; loop { grab(dig_image); thread_mutex_lock(buflock); while (bufavail == 0) do nothing; thread_mutex_unlock(buflock); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; thread_mutex_lock(buflock); bufavail = bufavail - 1; thread_mutex_unlock(buflock); } } tracker() { image_type track_image; int head = 0; loop { thread_mutex_lock(buflock); while (bufavail == MAX) do nothing; thread_mutex_unlock(buflock); track_image = frame_buf[head mod MAX]; head = head + 1; thread_mutex_lock(buflock); bufavail = bufavail + 1; thread_mutex_unlock(buflock); analyze(track_image); } } Problem?
Fix number 3 digitizer() { image_type dig_image; int tail = 0; loop { grab(dig_image); while (bufavail == 0) do nothing; frame_buf[tail mod MAX] = dig_image; tail = tail + 1; thread_mutex_lock(buflock); bufavail = bufavail - 1; thread_mutex_unlock(buflock); } } tracker() { image_type track_image; int head = 0; loop { while (bufavail == MAX) do nothing; track_image = frame_buf[head mod MAX]; head = head + 1; thread_mutex_lock(buflock); bufavail = bufavail + 1; thread_mutex_unlock(buflock); analyze(track_image); } } Problem?
condition variables • pthread_cond_wait: block for a signal • pthread_cond_signal: signal one waiting thread • pthread_cond_broadcast: signal all waiting threads
T1 T2 T1 T2 cond_signal (c) cond_wait (c, m) cond_wait (c, m) blocked cond_signal (c) resumed (a) Wait before signal (b) Wait after signal (T1 blocked forever) Wait and signal with cond vars
Fix number 4 – cond var digitizer() { image_type dig_image; int tail = 0; loop { grab(dig_image); thread_mutex_lock(buflock); if (bufavail == 0) thread_cond_wait(buf_not_full, buflock); thread_mutex_unlock(buflock); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; thread_mutex_lock(buflock); bufavail = bufavail - 1; thread_cond_signal(buf_not_empty); thread_mutex_unlock(buflock); } } tracker() { image_type track_image; int head = 0; loop { thread_mutex_lock(buflock); if (bufavail == MAX) thread_cond_wait(buf_not_empty, buflock); thread_mutex_unlock(buflock); track_image = frame_buf[head mod MAX]; head = head + 1; thread_mutex_lock(buflock); bufavail = bufavail + 1; thread_cond_signal(buf_not_full); thread_mutex_unlock(buflock); analyze(track_image); } } This solution is correct so long as there is exactly one producer and one consumer
Gotchas in programming with cond vars acquire_shared_resource() { thread_mutex_lock(cs_mutex); if (res_state == BUSY) thread_cond_wait (res_not_busy, cs_mutex); res_state = BUSY; thread_mutex_unlock(cs_mutex); } release_shared_resource() { thread_mutex_lock(cs_mutex); res_state = NOT_BUSY; thread_cond_signal(res_not_busy); thread_mutex_unlock(cs_mutex); } T3 is here T2 is here T1 is here
State of waiting queues T2 T3 cs_mutex T3 cs_mutex res_not_busy T2 res_not_busy (a) Waiting queues after T1 signals (a) Waiting queues before T1 signals
Defensive programming – retest predicate acquire_shared_resource() { thread_mutex_lock(cs_mutex); T3 is here while (res_state == BUSY) thread_cond_wait (res_not_busy, cs_mutex); T2 is here res_state = BUSY; thread_mutex_unlock(cs_mutex); } release_shared_resource() { thread_mutex_lock(cs_mutex); res_state = NOT_BUSY; T1 is here thread_cond_signal(res_not_buys); thread_mutex_unlock(cs_mutex); }
mail box Dispatcher workers mail box (a) Dispatcher model (b) Team model stages mail box (c) Pipelined model Threads as software structuring abstraction
User Program data DOS code Kernel data Threads and OS Traditional OS • DOS • memory layout • protection between user and kernel?
P2 P1 process code and data process code and data user PCB PCB kernel code and data kernel • Unix • memory layout • protection between user and kernel? • PCB?
programs in these traditional OS are single threaded • one PC per program (process), one stack, one set of CPU registers • if a process blocks (say disk I/O, network communication, etc.) then no progress for the program as a whole
MT Operating Systems How widespread is support for threads in OS? • Digital Unix, Sun Solaris, Win95, Win NT, Win XP Process Vs. Thread? • in a single threaded program, the state of the executing program is contained in a process • in a MT program, the state of the executing program is contained in several ‘concurrent’ threads
Process Vs. Thread P1 P2 • computational state (PC, regs, …) for each thread • how different from process state? T3 T1 T2 T1 P1 P2 User code data data code PCB PCB Kernel kernel code and data
stack stack1 stack2 stack3 stack4 heap heap global global code code (a) ST program (b) MT program
threads • share address space of process • cooperate to get job done • threads concurrent? • may be if the box is a true multiprocessor • share the same CPU on a uniprocessor • threaded code different from non-threaded? • protection for data shared among threads • synchronization among threads
Threads Implementation • user level threads • OS independent • scheduler is part of the runtime system • thread switch is cheap (save PC, SP, regs) • scheduling customizable, i.e., more app control • blocking call by thread blocks process
User P2 P1 P3 T3 T2 T3 T1 T2 T1 Threads library Threads library T3 T1 T2 T3 T1 T2 thread ready_q thread ready_q mutex, cond_var mutex, cond_var Kernel P3 P1 P2 process ready_q
Currently executing thread User P1 T3 T2 T1 Threads library Kernel Blocking call to the OS Upcall to the threads library
solution to blocking problem in user level threads • non-blocking version of all system calls • polling wrapper in scheduler for such calls • switching among user level threads • yield voluntarily • how to make preemptive? • timer interrupt from kernel to switch
Kernel level • expensive thread switch • makes sense for blocking calls by threads • kernel becomes complicated: process vs. threads scheduling • thread packages become non-portable • problems common to user and kernel level threads • libraries • solution is to have thread-safe wrappers to such library calls
User P2 P1 P3 T3 T2 T2 T1 T1 Kernel T3 T1 T2 T1 T2 thread level scheduler P3 process level scheduler P1 P2 process ready_q
User P2 P1 P3 T3 T2 T2 T1 T1 lwp Kernel Solaris threads
Thread safe libraries /* original version */ | /* thread safe version */ | | mutex_lock_type cs_mutex; void *malloc(size_t size)| void *malloc(size_t size) { | { | thread_mutex_lock(cs_mutex); | ...... | ...... ...... | ...... | | thread_mutex_unlock(cs_mutex); | return(memory_pointer);| return (memory_pointer); } | }
Synchronization support • Lock • Test and set instruction
Shared Memory Shared bus . . . . CPU CPU CPU CPU Input/output SMP
Shared Memory Shared bus cache cache cache . . . . CPU CPU CPU SMP with per-processor caches
Cache consistency problem Shared Memory Shared bus X X X P3 P2 P1 T3 T1 T2
Shared Memory Shared Memory update -> Shared bus invalidate -> Shared bus X -> X’ X -> inv X -> inv X -> X’ X -> X’ X -> X’ P3 P3 P2 P2 P1 P1 T3 T3 T1 T2 T1 T2 (b) write-invalidate protocol (c) write-update protocol Two possible solutions
Given the following details about an SMP (symmetric multiprocessor): Cache coherence protocol: write-invalidate Cache to memory policy: write-back Initially: The caches are empty Memory locations: A contains 10 B contains 5 Consider the following timeline of memory accesses from processors P1, P2, and P3. Contents of caches and memory?
What is multithreading? • technique allowing program to do multiple tasks • is it a new technique? • has existed since the 70’s (concurrent Pascal, Ada tasks, etc.) • why now? • emergence of SMPs in particular • “time has come for this technology”
threads in a uniprocessor? • allows concurrency between I/O and user processing even in a uniprocessor box process active
Multiprocessor: First Principles • processors, memories, interconnection network • Classification: SISD, SIMD, MIMD, MISD • message passing MPs: e.g. IBM SP2 • shared address space MPs • cache coherent (CC) • SMP: a bus-based CC MIMD machine • several vendors: Sun, Compaq, Intel, ... • CC-NUMA: SGI Origin 2000 • non-cache coherent (NCC) • Cray T3D/T3E
What is an SMP? • multiple CPUs in a single box sharing all the resources such as memory and I/O • Is an SMP more cost effective than two uniprocessor boxes? • yes (roughly 20% more for a dual processor SMP compared to a uni) • modest speedup for a program on a dual-processor SMP over a uni will make it worthwhile