1 / 45

Multiprocessors and Multithreading – classroom slides

Multiprocessors and Multithreading – classroom slides. Example use of threads - 1. compute thread. I/O thread. compute. I/O request. I/O. I/O complete. I/O result Needed. I/O result Needed. compute. (a) Sequential process. (b) Multithreaded process.

rcuellar
Download Presentation

Multiprocessors and Multithreading – classroom slides

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiprocessors and Multithreading – classroom slides

  2. Example use of threads - 1 compute thread I/O thread compute I/O request I/O I/O complete I/O result Needed I/O result Needed compute (a) Sequential process (b) Multithreaded process

  3. Example use of threads - 2 Digitizer Tracker Alarm

  4. main thread Main thread foo thread thread_create(foo, args) thread_create(foo, args) (b) After thread creation (a) Before thread creation Programming Support for Threads • creation • pthread_create(top-level procedure, args) • termination • return from top-level procedure • explicit kill • rendezvous • creator can wait for children • pthread_join(child_tid) • synchronization • mutex • condition variables

  5. Sample program – thread create/join int foo(int n) { ..... return 0; } int main() { int f; thread_type child_tid; ..... child_tid = thread_create (foo, &f); ..... thread_join(child_tid); }

  6. producer consumer buffer Programming with Threads • synchronization • for coordination of the threads • communication • for inter-thread sharing of data • threads can be in different processors • how to achieve sharing in SMP? • software: accomplished by keeping all threads in the same address space by the OS • hardware: accomplished by hardware shared memory and coherent caches

  7. Need for Synchronization digitizer() { image_type dig_image; int tail = 0; loop { if (bufavail > 0) { grab(dig_image); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; bufavail = bufavail - 1; } } } tracker() { image_type track_image; int head = 0; loop { if (bufavail < MAX) { track_image = frame_buf[head mod MAX]; head = head + 1; bufavail = bufavail + 1; analyze(track_image); } } } Problem?

  8. frame_buf 0 99 …… tail head (First empty spot in frame_buf) (First valid filled frame in frame_buf) digitizer tracker bufavail = bufavail + 1; bufavail = bufavail – 1; bufavail Shared data structure

  9. Synchronization Primitives • lock and unlock • mutual exclusion among threads • busy-waiting Vs. blocking • pthread_mutex_trylock: no blocking • pthread_mutex_lock: blocking • pthread_mutex_unlock

  10. Fix number 1 – with locks digitizer() { image_type dig_image; int tail = 0; loop { thread_mutex_lock(buflock); if (bufavail > 0) { grab(dig_image); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; bufavail = bufavail - 1; } thread_mutex_unlock(buflock); } } tracker() ( image_type track_image; int head = 0; loop { thread_mutex_lock(buflock); if (bufavail < MAX) { track_image = frame_buf[head mod MAX]; head = head + 1; bufavail = bufavail + 1; analyze(track_image); } thread_mutex_unlock(buflock); } } Problem?

  11. Fix number 2 digitizer() { image_type dig_image; int tail = 0; loop { grab(dig_image); thread_mutex_lock(buflock); while (bufavail == 0) do nothing; thread_mutex_unlock(buflock); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; thread_mutex_lock(buflock); bufavail = bufavail - 1; thread_mutex_unlock(buflock); } } tracker() { image_type track_image; int head = 0; loop { thread_mutex_lock(buflock); while (bufavail == MAX) do nothing; thread_mutex_unlock(buflock); track_image = frame_buf[head mod MAX]; head = head + 1; thread_mutex_lock(buflock); bufavail = bufavail + 1; thread_mutex_unlock(buflock); analyze(track_image); } } Problem?

  12. Fix number 3 digitizer() { image_type dig_image; int tail = 0; loop { grab(dig_image); while (bufavail == 0) do nothing; frame_buf[tail mod MAX] = dig_image; tail = tail + 1; thread_mutex_lock(buflock); bufavail = bufavail - 1; thread_mutex_unlock(buflock); } } tracker() { image_type track_image; int head = 0; loop { while (bufavail == MAX) do nothing; track_image = frame_buf[head mod MAX]; head = head + 1; thread_mutex_lock(buflock); bufavail = bufavail + 1; thread_mutex_unlock(buflock); analyze(track_image); } } Problem?

  13. condition variables • pthread_cond_wait: block for a signal • pthread_cond_signal: signal one waiting thread • pthread_cond_broadcast: signal all waiting threads

  14. T1 T2 T1 T2 cond_signal (c) cond_wait (c, m) cond_wait (c, m) blocked cond_signal (c) resumed (a) Wait before signal (b) Wait after signal (T1 blocked forever) Wait and signal with cond vars

  15. Fix number 4 – cond var digitizer() { image_type dig_image; int tail = 0; loop { grab(dig_image); thread_mutex_lock(buflock); if (bufavail == 0) thread_cond_wait(buf_not_full, buflock); thread_mutex_unlock(buflock); frame_buf[tail mod MAX] = dig_image; tail = tail + 1; thread_mutex_lock(buflock); bufavail = bufavail - 1; thread_cond_signal(buf_not_empty); thread_mutex_unlock(buflock); } } tracker() { image_type track_image; int head = 0; loop { thread_mutex_lock(buflock); if (bufavail == MAX) thread_cond_wait(buf_not_empty, buflock); thread_mutex_unlock(buflock); track_image = frame_buf[head mod MAX]; head = head + 1; thread_mutex_lock(buflock); bufavail = bufavail + 1; thread_cond_signal(buf_not_full); thread_mutex_unlock(buflock); analyze(track_image); } } This solution is correct so long as there is exactly one producer and one consumer

  16. Gotchas in programming with cond vars acquire_shared_resource() { thread_mutex_lock(cs_mutex); if (res_state == BUSY) thread_cond_wait (res_not_busy, cs_mutex); res_state = BUSY; thread_mutex_unlock(cs_mutex); } release_shared_resource() { thread_mutex_lock(cs_mutex); res_state = NOT_BUSY; thread_cond_signal(res_not_busy); thread_mutex_unlock(cs_mutex); } T3 is here T2 is here T1 is here

  17. State of waiting queues T2 T3 cs_mutex T3 cs_mutex res_not_busy T2 res_not_busy (a) Waiting queues after T1 signals (a) Waiting queues before T1 signals

  18. Defensive programming – retest predicate acquire_shared_resource() { thread_mutex_lock(cs_mutex); T3 is here while (res_state == BUSY) thread_cond_wait (res_not_busy, cs_mutex); T2 is here res_state = BUSY; thread_mutex_unlock(cs_mutex); } release_shared_resource() { thread_mutex_lock(cs_mutex); res_state = NOT_BUSY; T1 is here thread_cond_signal(res_not_buys); thread_mutex_unlock(cs_mutex); }

  19. mail box Dispatcher workers mail box (a) Dispatcher model (b) Team model stages mail box (c) Pipelined model Threads as software structuring abstraction

  20. User Program data DOS code Kernel data Threads and OS Traditional OS • DOS • memory layout • protection between user and kernel?

  21. P2 P1 process code and data process code and data user PCB PCB kernel code and data kernel • Unix • memory layout • protection between user and kernel? • PCB?

  22. programs in these traditional OS are single threaded • one PC per program (process), one stack, one set of CPU registers • if a process blocks (say disk I/O, network communication, etc.) then no progress for the program as a whole

  23. MT Operating Systems How widespread is support for threads in OS? • Digital Unix, Sun Solaris, Win95, Win NT, Win XP Process Vs. Thread? • in a single threaded program, the state of the executing program is contained in a process • in a MT program, the state of the executing program is contained in several ‘concurrent’ threads

  24. Process Vs. Thread P1 P2 • computational state (PC, regs, …) for each thread • how different from process state? T3 T1 T2 T1 P1 P2 User code data data code PCB PCB Kernel kernel code and data

  25. stack stack1 stack2 stack3 stack4 heap heap global global code code (a) ST program (b) MT program

  26. threads • share address space of process • cooperate to get job done • threads concurrent? • may be if the box is a true multiprocessor • share the same CPU on a uniprocessor • threaded code different from non-threaded? • protection for data shared among threads • synchronization among threads

  27. Threads Implementation • user level threads • OS independent • scheduler is part of the runtime system • thread switch is cheap (save PC, SP, regs) • scheduling customizable, i.e., more app control • blocking call by thread blocks process

  28. User P2 P1 P3 T3 T2 T3 T1 T2 T1 Threads library Threads library T3 T1 T2 T3 T1 T2 thread ready_q thread ready_q mutex, cond_var mutex, cond_var Kernel P3 P1 P2 process ready_q

  29. Currently executing thread User P1 T3 T2 T1 Threads library Kernel Blocking call to the OS Upcall to the threads library

  30. solution to blocking problem in user level threads • non-blocking version of all system calls • polling wrapper in scheduler for such calls • switching among user level threads • yield voluntarily • how to make preemptive? • timer interrupt from kernel to switch

  31. Kernel level • expensive thread switch • makes sense for blocking calls by threads • kernel becomes complicated: process vs. threads scheduling • thread packages become non-portable • problems common to user and kernel level threads • libraries • solution is to have thread-safe wrappers to such library calls

  32. User P2 P1 P3 T3 T2 T2 T1 T1 Kernel T3 T1 T2 T1 T2 thread level scheduler P3 process level scheduler P1 P2 process ready_q

  33. User P2 P1 P3 T3 T2 T2 T1 T1 lwp Kernel Solaris threads

  34. Thread safe libraries /* original version */ | /* thread safe version */ | | mutex_lock_type cs_mutex; void *malloc(size_t size)| void *malloc(size_t size) { | { | thread_mutex_lock(cs_mutex); | ...... | ...... ...... | ...... | | thread_mutex_unlock(cs_mutex); | return(memory_pointer);| return (memory_pointer); } | }

  35. Synchronization support • Lock • Test and set instruction

  36. Shared Memory Shared bus . . . . CPU CPU CPU CPU Input/output SMP

  37. Shared Memory Shared bus cache cache cache . . . . CPU CPU CPU SMP with per-processor caches

  38. Cache consistency problem Shared Memory Shared bus X X X P3 P2 P1 T3 T1 T2

  39. Shared Memory Shared Memory update -> Shared bus invalidate -> Shared bus X -> X’ X -> inv X -> inv X -> X’ X -> X’ X -> X’ P3 P3 P2 P2 P1 P1 T3 T3 T1 T2 T1 T2 (b) write-invalidate protocol (c) write-update protocol Two possible solutions

  40. Given the following details about an SMP (symmetric multiprocessor): Cache coherence protocol: write-invalidate Cache to memory policy: write-back Initially: The caches are empty Memory locations: A contains 10 B contains 5 Consider the following timeline of memory accesses from processors P1, P2, and P3. Contents of caches and memory?

  41. What is multithreading? • technique allowing program to do multiple tasks • is it a new technique? • has existed since the 70’s (concurrent Pascal, Ada tasks, etc.) • why now? • emergence of SMPs in particular • “time has come for this technology”

  42. threads in a uniprocessor? • allows concurrency between I/O and user processing even in a uniprocessor box process active

  43. Multiprocessor: First Principles • processors, memories, interconnection network • Classification: SISD, SIMD, MIMD, MISD • message passing MPs: e.g. IBM SP2 • shared address space MPs • cache coherent (CC) • SMP: a bus-based CC MIMD machine • several vendors: Sun, Compaq, Intel, ... • CC-NUMA: SGI Origin 2000 • non-cache coherent (NCC) • Cray T3D/T3E

  44. What is an SMP? • multiple CPUs in a single box sharing all the resources such as memory and I/O • Is an SMP more cost effective than two uniprocessor boxes? • yes (roughly 20% more for a dual processor SMP compared to a uni) • modest speedup for a program on a dual-processor SMP over a uni will make it worthwhile

More Related