560 likes | 627 Views
Introduction. Why parallel?. 因為需要 large amount of computation. Example. 3D 立體影像. 用在外科手術時 , 每秒約需 10 15 個計算. 一般的電腦每秒約可算 10 7 個計算. 10 15 /10 7 ≒10 8 秒 10 8 /10 5 ≒10 3 天 ≒3 年. 其他需要大量計算的領域. Aircraft Testing New Drug Development Oil Exploration Modeling Fusion Reactors Economic Planning
E N D
Introduction Why parallel? 因為需要 large amount of computation.
Example 3D立體影像 用在外科手術時,每秒約需1015個計算 一般的電腦每秒約可算107個計算 1015/107≒108秒 108/105≒103天≒3年
其他需要大量計算的領域 Aircraft Testing New Drug Development Oil Exploration Modeling Fusion Reactors Economic Planning Cryptanalysis Managing Large Databases Astronomy Biomedical Analysis …
電腦的速度 每3、4年增快一倍 Unfortunately 光速: 3×108 m/s Components 太近: 會互相干擾 It is evident that this trend will soon come to an end. A simple law of physics
Parallelism 人(CPU)海戰術 目前許多電腦中都不只一顆CPU
Models of Computation Flynn's classification: SISD(Single Instruction, Single Data) MISD(Multiple Instruction, Single Data) SIMD(Single Instruction, Multiple Data) MIMD(Multiple Instruction, Multiple Data)
SISD Computers Sequential(serial) Computation
Example summation of n numbers: Sum = 0 For I = 1 to n Sum = Sum + I Next need n-1 additions O(n) time
Example test whether a positive integer is prime or not. n is prime: no divisors except 1 & n. n is composite: otherwise Each processor tests a number between 1 and n. Each processor needs O(1) operation. O(1) time
Share-Memory(SM) SIMD computers PRAM(Parallel Random Access Machine)
How does a processor i pass a number to another processor j? O(1) time
1. EREW (Exclusive Read, Exclusive Write)2. CREW (Concurrent Read, Exclusive Write)3. ERCW (Exclusive Read, Concurrent Write)4. CRCW (Concurrent Read, Concurrent Write) Write Read
How to resolve conflicts in CRCW? (a) the smallest-numbered processor is allowed to write, and access is denied to all other processor. (b) the values to be written in all processorsare equal, otherwise access is denied to all processor. (c) sum the values to be written.
Simulating multiple accesses on an EREW computer: • N multiple accesses (read) : broadcasting procedure:1. P1→P22. P1,P2, → P3,P43. P1,P2,P3,P4 → P5,P6,P7,P8 : :O(log N) steps
5 5 5 5 5 5 5 5 5 5 5 5 D 2 3 4 5 6 8 1 7 5 5 A 5 5 5 P2 P3 P4 P5 P6 P7 P8 P1
(ii) m out of N multiple access(read), m≦N Each memory location is associated with a binary tree.
[1] [7] [7] [1] [4] [1] [1] [1] [2] [2] [2] [4] [4] [4] [7] [7] d P7 P1 P2 P4
d [4] d [7] [1] [1] d [2] [4] d [7] d d d [1] d [7] d [1] d P7 P1 P2 P4
Interconnection networks SIMD computers The memory is distributed among the N processors. Every pair of processors can communicate with each other if they are connected by a line(link). Instantaneous communication between several pairs of processors is allowed.
(1) fully connected networks links. There are
(2) linear array(1-dimensional array) 1-D array Pi-1 Pi Pi+1 , 2≦i≦ N–1
(3) two-dimensional array (2-D mesh-connected)
P(j,k) has links connecting to P(j+1,k), P(j-1,k), P(j,k+1) and P(j,k-1).
The sons of Pi: P2i and P2i+1 The parent of Pi: Pi/2
(5) shuffle-exchange connection shuffle links: Pi Pj j=2i if 0 i exchange links: Pi Pi+1 if i is even j=2i-(N-1) if i N-1
(6) cube connection (n-cube,hypercube) 3-cube:
An n-cube connected network consists of 2n node. (N=2n) binary representation of node i: in-1 in-2 ... ij ... i1 i0 connects to in-1 in-2 ... ... i1 i0 Each processor has n links.
This concept can be implemented on a tree machine. time: O(log n) , n:# of inputs m sets, each of n numbers: pipelining: log n+(m-1) steps. This concept can also be implemented on other models.
MIMD computers sharing a common memory: multiprocessors, tightly coupled machines MIMD computers with an interconnection network: multicomputers, loosely coupled machines, distributed system
以工程為例: ● 到完工所花的時間 ● 和一個人所花的時間比較 ● 人年(或人月) ● 一人之人年/n人之人年 Analyzing Algorithms ● running time ● speedup ● cost ● efficiency
Running Time 從 the first processor begins computing 到 the last processor ends computing 的時間
Running Time Sum = 0 For I = 1 to n Sum = Sum + I Next 1次 N+1次 N次 N+1次 共3N+3次 = O(N)
f(n) = n g(n) = 3n + 3 n0 c g(n)=O(f(n)): at most f(n), 3n+3 ≦ 5n 當 n≧2時
Upper Bound g(n)=O(f(n)): at most f(n),
cf(n) g(n) n0
f(n) = n g(n) = 3n + 3 n0 c g(n)= Ω(f(n)): at least f(n), 3n+3 ≧ 2n 當 n≧1時
Lower Bound g(n)= Ω(f(n)): at least f(n),
g(n) cf(n) n0
adding 8 numbers: time: O(log n)
The lower bound of sorting is Ω(n log n). (why?) The time complexity of heapsort is O(n log n). Thus, heapsort is optimal. The upper bound of sorting is O(n log n).
For matrix multiplication, the lower bound is Ω(n2) and the upper bound is O(n2.5). (There exists such an algorithm.)
speedup = 即:一個CPU所花的時間/平行所花的時間