120 likes | 257 Views
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM). Alejandro Salinger Cheriton School of Computer Science University of Waterloo Joint work with Alejandro L ópez -Ortiz and Reza Dorrigiv. Multicore Challenge.
E N D
Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo Joint work with Alejandro López-Ortiz and Reza Dorrigiv
Multicore Challenge • RAM model will no longer accurately reflect the architecture on which algorithms are executed. • PRAM facilitates design and analysis, however: • Unrealistic. • Difficult to derive work-optimal algorithms for Θ(n) processors. • 2, 4, or 8 cores per chip: low degree parallelism. • Thread-based parallelism. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
Multicore Challenge • Design a model such that: • Reflects available degree of parallelism. • Multi-threaded. • Easy theoretical analysis. • Easy to program. “Programmability has now replaced power as the number one impediment to the continuation of Moore’s law” [Gartner] Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
The LoPRAM Model • Number of cores is not a constant: modeled as O(log n). • Similar to bit-level parallelism, w = O(log n)-bit word. LoPRAM: • PRAM with p = O(log n) processors running in MIMD mode. • Concurrent Read Exclusive Write (CREW). • Simplest form: high-level thread-based parallelism. • Semaphores and automatic serialization available and transparent to programmer. • p = O(log n) but not p = Θ(log n). Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
PAL-threads void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); } } Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
PAL-threads void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); } } pending active waiting Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
Work-Optimal Algorithms: Divide & Conquer • Recursive divide-and-conquer algorithms with time given by: • By the master theorem: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
Divide & Conquer Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
Divide & Conquer • Parallel Master theorem in the LoPRAM: If we assume parallel merging the third case becomes Tp(n) = (f (n)/p). Optimal speedup [i.e. Tp(n) = T(n)/p ] so long as p = O(log n). Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
T(n)=7T(n/2)+O(n2) T(n)=O(n2.8) Tp(n)=O(n2.8/p) Matrix Multiplication Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
Dynamic programming • Generic parallel algorithm that exploits the parallelism of execution in the DAG. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger
Conclusions • Computers have a small number of processors. • The assumption that p=O(log n) or even O(log2 n) will last for a while. • Designing work-optimal algorithms for a small number of processors is easy. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger