Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM)

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo Joint work with Alejandro López-Ortiz and Reza Dorrigiv

Multicore Challenge • RAM model will no longer accurately reflect the architecture on which algorithms are executed. • PRAM facilitates design and analysis, however: • Unrealistic. • Difficult to derive work-optimal algorithms for Θ(n) processors. • 2, 4, or 8 cores per chip: low degree parallelism. • Thread-based parallelism. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Multicore Challenge • Design a model such that: • Reflects available degree of parallelism. • Multi-threaded. • Easy theoretical analysis. • Easy to program. “Programmability has now replaced power as the number one impediment to the continuation of Moore’s law” [Gartner] Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

The LoPRAM Model • Number of cores is not a constant: modeled as O(log n). • Similar to bit-level parallelism, w = O(log n)-bit word. LoPRAM: • PRAM with p = O(log n) processors running in MIMD mode. • Concurrent Read Exclusive Write (CREW). • Simplest form: high-level thread-based parallelism. • Semaphores and automatic serialization available and transparent to programmer. • p = O(log n) but not p = Θ(log n). Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

PAL-threads void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); } } Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

PAL-threads void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); } } pending active waiting Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Work-Optimal Algorithms: Divide & Conquer • Recursive divide-and-conquer algorithms with time given by: • By the master theorem: Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Divide & Conquer Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Divide & Conquer • Parallel Master theorem in the LoPRAM: If we assume parallel merging the third case becomes Tp(n) = (f (n)/p). Optimal speedup [i.e. Tp(n) = T(n)/p ] so long as p = O(log n). Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

T(n)=7T(n/2)+O(n2) T(n)=O(n2.8) Tp(n)=O(n2.8/p) Matrix Multiplication Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Dynamic programming • Generic parallel algorithm that exploits the parallelism of execution in the DAG. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Conclusions • Computers have a small number of processors. • The assumption that p=O(log n) or even O(log2 n) will last for a while. • Designing work-optimal algorithms for a small number of processors is easy. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM)

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM)

Presentation Transcript

Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU

MIDeA :A Multi-Parallel Instrusion Detection Architecture

Multi-core systems System Architecture COMP25212

On Optimal Multi-dimensional Mechanism Design

Scaling The Speedup of multi-core chips based on Amdahl’s law

Multi-core systems System Architecture COMP25212

P-QEMU: A Parallel Multi-core System Emulator Based On QEMU

Parallel Data Mining with Services on Multi-core systems

Parallel Implementation of Fast Fourier Transform on a Multi-core System

POD: A Parallel-On-Die Architecture

A Multi-Threading Architecture…

Parallel Architecture

Optimizing the Fast Fourier Transform on a Multi-core Architecture

Koorde: A Simple Degree Optimal DHT

Implementing Parallel CG Algorithm on the EARTH Multi-threaded Architecture

Multi-core systems System Architecture COMP25212

Extending the Unified Parallel Processing Speedup Model

View-Oriented Parallel Programming for multi-core systems

SODA: A Low-power (Multi-Core) Architecture For Software Radio

Multi-core and Beyond COMP25212 System Architecture

Multi-core systems COMP25212 System Architecture