Maximizing Computation Speed through Parallel Processing

Introduction Why parallel? 因為需要 large amount of computation.

Example 3D立體影像用在外科手術時，每秒約需1015個計算一般的電腦每秒約可算107個計算 1015/107≒108秒 108/105≒103天≒3年

其他需要大量計算的領域 Aircraft Testing New Drug Development Oil Exploration Modeling Fusion Reactors Economic Planning Cryptanalysis Managing Large Databases Astronomy Biomedical Analysis …

電腦的速度 每3、4年增快一倍 Unfortunately 光速: 3×108 m/s Components 太近: 會互相干擾 It is evident that this trend will soon come to an end. A simple law of physics

Parallelism 人(CPU)海戰術目前許多電腦中都不只一顆CPU

Models of Computation Flynn's classification: SISD(Single Instruction, Single Data) MISD(Multiple Instruction, Single Data) SIMD(Single Instruction, Multiple Data) MIMD(Multiple Instruction, Multiple Data)

SISD Computers Sequential(serial) Computation

Example summation of n numbers: Sum = 0 For I = 1 to n Sum = Sum + I Next need n-1 additions O(n) time

MISD Computers

Example test whether a positive integer is prime or not. n is prime: no divisors except 1 & n. n is composite: otherwise Each processor tests a number between 1 and n. Each processor needs O(1) operation. O(1) time

SIMD Computers

Share-Memory(SM) SIMD computers PRAM(Parallel Random Access Machine)

How does a processor i pass a number to another processor j? O(1) time

1. EREW (Exclusive Read, Exclusive Write)2. CREW (Concurrent Read, Exclusive Write)3. ERCW (Exclusive Read, Concurrent Write)4. CRCW (Concurrent Read, Concurrent Write) Write Read

How to resolve conflicts in CRCW? (a) the smallest-numbered processor is allowed to write, and access is denied to all other processor. (b) the values to be written in all processorsare equal, otherwise access is denied to all processor. (c) sum the values to be written.

Simulating multiple accesses on an EREW computer: • N multiple accesses (read) : broadcasting procedure:1. P1→P22. P1,P2, → P3,P43. P1,P2,P3,P4 → P5,P6,P7,P8 : :O(log N) steps

5 5 5 5 5 5 5 5 5 5 5 5 D 2 3 4 5 6 8 1 7 5 5 A 5 5 5 P2 P3 P4 P5 P6 P7 P8 P1

(ii) m out of N multiple access(read), m≦N Each memory location is associated with a binary tree.

[1] [7] [7] [1] [4] [1] [1] [1] [2] [2] [2] [4] [4] [4] [7] [7] d P7 P1 P2 P4

d [4] d [7] [1] [1] d [2] [4] d [7] d d d [1] d [7] d [1] d P7 P1 P2 P4

Dividing a shared memory into blocks

Interconnection networks SIMD computers The memory is distributed among the N processors. Every pair of processors can communicate with each other if they are connected by a line(link). Instantaneous communication between several pairs of processors is allowed.

(1) fully connected networks links. There are

(2) linear array(1-dimensional array) 1-D array Pi-1 Pi Pi+1 , 2≦i≦ N–1

(3) two-dimensional array (2-D mesh-connected)

P(j,k) has links connecting to P(j+1,k), P(j-1,k), P(j,k+1) and P(j,k-1).

(4) tree connection (tree machines)

The sons of Pi: P2i and P2i+1 The parent of Pi: Pi/2

(5) shuffle-exchange connection shuffle links: Pi Pj j=2i if 0  i  exchange links: Pi Pi+1 if i is even j=2i-(N-1) if i  N-1

(6) cube connection (n-cube,hypercube) 3-cube:

4-cube:

An n-cube connected network consists of 2n node. (N=2n) binary representation of node i: in-1 in-2 ... ij ... i1 i0 connects to in-1 in-2 ... ... i1 i0 Each processor has n links.

adding 8 numbers:

This concept can be implemented on a tree machine. time: O(log n) , n:# of inputs m sets, each of n numbers: pipelining: log n+(m-1) steps. This concept can also be implemented on other models.

MIMD Computers

MIMD computers sharing a common memory: multiprocessors, tightly coupled machines MIMD computers with an interconnection network: multicomputers, loosely coupled machines, distributed system

以工程為例: ● 到完工所花的時間 ● 和一個人所花的時間比較 ● 人年(或人月) ● 一人之人年/n人之人年 Analyzing Algorithms ● running time ● speedup ● cost ● efficiency

Running Time 從 the first processor begins computing 到 the last processor ends computing 的時間

Running Time Sum = 0 For I = 1 to n Sum = Sum + I Next 1次 N+1次 N次 N+1次共3N+3次 = O(N)

f(n) = n g(n) = 3n + 3 n0 c g(n)=O(f(n)): at most f(n), 3n+3 ≦ 5n 當 n≧2時

Upper Bound g(n)=O(f(n)): at most f(n),

cf(n) g(n) n0

f(n) = n g(n) = 3n + 3 n0 c g(n)= Ω(f(n)): at least f(n), 3n+3 ≧ 2n 當 n≧1時

Lower Bound g(n)= Ω(f(n)): at least f(n),

g(n) cf(n) n0

adding 8 numbers: time: O(log n)

The lower bound of sorting is Ω(n log n). (why?) The time complexity of heapsort is O(n log n). Thus, heapsort is optimal. The upper bound of sorting is O(n log n).

For matrix multiplication, the lower bound is Ω(n2) and the upper bound is O(n2.5). (There exists such an algorithm.)

speedup = 即:一個CPU所花的時間/平行所花的時間

Maximizing Computation Speed through Parallel Processing

Maximizing Computation Speed through Parallel Processing

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction