520 likes | 1.24k Views
Chapter 8 -1 : Multiple Processor Systems. Multiple Processor Systems Multiprocessor Hardware UMA Multiprocessors NUMA Multiprocessors Multicore Chips Multiprocessor Operating Systems Types of Operating Systems Multiprocessor Syncronization Multiprocessor Scheduling. More Power Problems.
E N D
Chapter 8-1 : Multiple Processor Systems • Multiple Processor Systems • Multiprocessor Hardware • UMA Multiprocessors • NUMA Multiprocessors • Multicore Chips • Multiprocessor Operating Systems • Types of Operating Systems • Multiprocessor Syncronization • Multiprocessor Scheduling
More Power Problems • Run the clock faster • Electrical signals travel 20cm/nsec in copper (Einstein’s special theory of relativity) • Signals can not travel for more than 2 cm for a 10 GHz clock, or 2 mm for a 100 GHz computer • Making computers this small may be possible, but we have a heat dissipation problem • We have limitations for the time being, so the solution to more power is through using multiple and paralel CPU’s
Multiple Processor Systems Figure 8-1. (a) A shared-memory multiprocessor. (b) A message-passing multicomputer. (c) A wide area distributed system.
Multiple Processor Systems Shared-memory multiprocessors: Every CPU has equal access to the entire physical memory Message-passing multicomputers: Each CPU has it’s own memory. The CPU’s communicate with each other using messages over the interconnection structure Wide area distributed system: Computer systems connected over a network. Communication is again by messages but there is a delay due to the network
Multiprocessor Hardware UMA Multiprocessors NUMA Multiprocessors Multicore Chips
UMA (Uniform Memory Access) Multiprocessors with Bus-Based Architectures (1) Figure 8-2. Three bus-based multiprocessors. (a) Without caching. (b) With caching. (c) With caching and private memories. UMA : Uniform access to the entire memory, same access times for all CPU’s
UMA Multiprocessors with Bus-Based Architectures (2) • Each CPU has to wait for the bus to be idle to read or write to the memory • For 2 or 3 computers, bus contention is manageable (Figure 8.2 (a)) • For larger number of CPU’s, a cache is added to the CPU. Since reads can be satisfied by cache contents, there will be less traffic on the bus (Figure 8.2 (b)) • Writing has to be managed! • Some systems have private and shared memories (Figure 8.3 (c)) • Mostly private memory is used. Shared memory is for shared variables between CPUs • Needs carefull programming!
UMA Multiprocessors Using Crossbar Switches (1) Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An open crosspoint. (c) A closed crosspoint.
UMA Multiprocessors Using Crossbar Switches (2) Use of a single bus limits (even with caches) the number of CPUs to about 16 or 32 CPUs A crossbar switch connecting n CPUs to k memories may solve this problem A crosspoint is a small electronic switch Contention for memory is still possible if k < n. Partitioning the memory into n units may reduce the contention
UMA Multiprocessors Using Multistage Switching Networks (1) Figure 8-4. (a) A 2 × 2 switch with two input lines, A and B, and two output lines, X and Y. (b) A message format. • Module : memory unit • Address : an address within a module • Opcode : Read or Write • Value : value to be written
UMA Multiprocessors Using Multistage Switching Networks (2) Figure 8-5. An omega switching network.
NUMA (Nonuniform) Multiprocessors (1) Characteristics of NUMA machines: • There is a single address space visible to all CPUs. • Access to remote memory is via LOAD and STORE instructions. • Access to remote memory is slower than access to local memory.
NUMA Multiprocessors (2) Figure 8-6. (a) A 256-node directory-based multiprocessor.(b)Division of a 32-bit memory address intofields. (c) The directory at node 36.
NUMA Multiprocessors (3) Let us assume that each node has one CPU, 16 MB of ram and a cache The total memory is 232 bytes, divided up into 226 cache lines (blocks) of 64 bytes each The total memory is allocated among nodes, with 0-16 MB in node 0, 16-32 MB in node 1, and so on Each node has a directory containing an entry for each of the 218 (262,144) 64-byte cache lines Each directory entry is 9 bits (cache presence bit + 8 bits for a node number), so the total directory size is 218 * 9 = 2,359,296 bits = 294,912 bytes We will assume that a cache line (memory block) is held in the cache of one node only (single copy)
NUMA Multiprocessors (4) The directory of each node is kept in an extremely fast special-purpose hardware, since directory must be queried on every instruction that references memory (so expensive) Let us assume that CPU 20 references the address 0x 24000108. This address corresponds node 36, block 4, offset 8 in decimal Node 20 sends a request message to node 36 to find whether block 4 is cached ot not (NOT from Figure 8-6 (c)) Node 36 fetches block 4 from it’s local ram, sends it back to the to node 20, and updates the directory entry to indicate that the line is now cached at node 20 Now let us assume that node 20 makes a reference to line 2 of node 36. This line is cached in node 82 (Figure 8-6 (c)). Node 82 passes the line to node 20
Multicore Chips (1) Moore's Law: The number of transistors that can be placed on a chip increases exponentially, doubling approximately every two years. Multicore (dual-core, quad-core) means more than one complete CPU on the same chip The CPU’s share the same main memory but they may have private (AMD) or shared (Intel) cache memory Snooping : special hardware circuitry makes sure that if a word is present in two or more caches and one of the CPUs modifies the word, it is automatically and atomically removed from all caches in order to maintain consistency
Multiprocessor Operating Systems Each CPU has its own operating system Master-Slave multiprocessors Symmetric Multiprocessors
Each CPU Has Its Own Operating System (1) Figure 8-7. Partitioning multiprocessor memory among four CPUs, but sharing a single copy of the operating system code (The boxes marked Data are the operatingsystem’s private data for each CPU).
Each CPU Has Its Own Operating System (2) CPUs share the OS code System calls handled by individual CPUs No sharing of processes since each CPU has OS tables of its own Each CPU schedules its own processes and may be idle if there are no processes Since memory allocation is fixed, pages are not shared among CPUs Since each OS maintains its own cache of disk blocks, there may be inconsistency if blocks are modified by different Oss This model was used in the early days of multiprocessors and rarely used these days
Master-Slave Multiprocessors Figure 8-8. A master-slave multiprocessor model. One master OS in one CPU distributes work among other slave CPUs Problems of previous arcitecture (no page sharing, CPUs idle, buffer cache inconsistency) are solved Master CPU may be overloaded since it has to cater for all others
Symmetric Multiprocessors (1) Figure 8-9. The SMP multiprocessor model. Each CPU runs a single but shared copy of OS independently
Symmetric Multiprocessors (2) • When a system call is made, the CPU on which the call is made traps the kernel and processes the call • This model eliminates the asymmetry of master-slave configuration • One copy of OS executing on different CPUs • One set of OS tables • No master CPU bootleneck
Symmetric Multiprocessors (3) Problem: What will happen if two CPUs try to claim the same free page? This is one example out of many Locks (mutex) are provided to solve these problems The OS is splitted into several critical regions (each controlled by different locks) that can be executed concurrently by different CPUs without interfering with each other
Multiprocessor Synchronization (1) Figure 8-10. The TSL (Test and Set Lock) instruction can fail if the bus cannot be locked. These four steps show a sequence of events where the failure is demonstrated.
Multiprocessor Synchronization (2) • The synchronization problem of previous slide can be prevented • By locking the bus so that other CPUs can not access it • Execute the TSL instruction • Unlock the bus • This can be done preferably by hardware locking or by software using spin locks (process executes a tight loop testing its status)
Multiprocessor Scheduling • Uniprocessor scheduling : scheduler chooses the thread to run next • Multiprocessor scheduling : scheduler has to choose a thread and a CPU • Another complication factor is the thread relations • Unrelated threads as in multi-user timesharing environments • Related threads as in the threads of some application working together (such as compilers make command)
Thread Scheduling Algorithms Time-Sharing : unrelated threads Space Sharing : related threads Gang Scheduling : related threads
Timesharing Figure 8-12. Using a single data structure for scheduling a multiprocessor. 16 CPUs all busy, CPU 4 becomes idle, locks scheduling queues and selects thread A. Next CPU 12 goes idle and chooses thread B CPUs are time-shared and we have load balancing (no overload, but work is distributed)
Space Sharing (1) Figure 8-13. A set of 32 CPUs split into four partitions, with two CPUs available. Partitioning is based on related (grouped) threads
Space Sharing (2) Scheduling multiple related threads at the same time across multiple CPUs is called spacesharing Some applications benefit from this approach such as compilers make command Each thread in a group is given its dedicated CPU. This thread holds on to the CPU until it terminates. If a thread blocks on I/O, it continues to hold the CPU If there are not enough CPUs to start all threads of a group at the same time, the whole group waits until there are
Space Sharing (3) Space sharing does not use multiprogramming so we do not have context switching overhead Time is wasted when a thread blocks for I/O or for some other event and CPU is idle Consequently, people have looked for algorithms that attempt to schedule in both time and space together, especially for threads that create multiple threads, which usually need to communicate with one another
Gang Scheduling (1) Figure 8-14. Communication between two threads belonging to thread A that are running out of phase. Consider process A with threads A0 and A1, process B with threads B0 and B1. A0 and B0 execute on CPU 0, A1 and B1 on CPU 1 A0 sends a message to A1, A1 sends a reply back repeately A0 sends the message but gets the reply after a delay of 200 msec because of B’s threads
Gang Scheduling (2) The three parts of gang scheduling: • Groups of related threads are scheduled as a unit, a gang. • All members of a gang run simultaneously, on different timeshared CPUs. • All gang members start and end their time slices together.
Gang Scheduling (3) Time is divided into time slices. At the start of a time slice, all CPUs are rescheduled with a new thread in each. At the start of the next time slice another scheduling is done. No scheduling is done in between If a thread blocks, its CPU stays idle until the end of the quantum
Gang Scheduling (4) Figure 8-15. Gang scheduling (6 CPUs, 5 processes and 24 threads in total). In gang scheduling all threads of a process run together, so that if one of them sends a request to another one, it will get the message and reply immediately