140 likes | 326 Views
ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance. Performance Analysis of Interconnection Networks. Bandwidth Latency Proportional to diameter Latency with contention Processor utilization + Physical constraints. (. k. 1. ). -. n. torus. 1.
E N D
ECE 669Parallel Computer ArchitectureLecture 12Interconnection Network Performance
Performance Analysis of Interconnection Networks • Bandwidth • Latency • Proportional to diameter • Latency with contention • Processor utilization • + Physical constraints
( k 1 ) - n torus 1 dir channels - - 2 Direct nk Avg DIST = 3 Msg size B = Bandwidth & Latency Latency • Average latency kd: • Bandwidth per node - more complex • if all near neighbor messages • If average travel dist then? • Let worst case nk - k k ~ n torus 2 dir channels - - 4 k ~ n no end conns 2 dir channels - - 3 1 B
3 kB Analogy • If each student takes 8 years to graduate • And if a Prof. can support 10 students max at any time How many new students can Prof. take on in a year? Each year take on • Similarly: • Network has channels • # of flits it can sustain = • # of msgs it can concurrently sustain= • Each msg flit uses channels to dest • So Nn Nn B nk 3 nk Nn = x N 3 B or = x = BW per node
Another way of getting BW is: Nn • Max # msgs in net at any time • These take cycles to get delivered, during which time no new mgs can get in • I.e. we can inject or # injected per node per cycle • Note: We have not considered contention thus far. • In practice, latency shoots up much before we achieve the theoretical due to contention. = B kn = 3 Nn kn msgs every cycles B 3 Nn 3 1 = · · B kn N 3 = Bk
What’s a good performance metric to analyze networks • Possibilities: • Bandwidth • Latency • Discuss • Bandwidth - good when latency not an issue. I.e., if we can overlap all communication with computation • Estimate of how much info we can transport • Latency - good when not much parallelism exists (critical sections, e.g.), or when communication cannot be overlapped with computation. • Effective processor utilization best metric - but always not easy to derive. L 1 BW U = 1 req rate * L (U) +
Mem Latency T = (1+c) hops + M + B Performance Analysis • First, analysis based on latency alone • Taking contention into account, but ignoring technological constraints Such analysis is o.k., as long as we do not compare workloads with different traffic rates With no contention, With contention, Average contention delay cycles at each switch Latency T = hops + M + B Delay M Network Hops B
( ( ) ) k k 1 1 r r - - B B d d æ æ 1 1 ö ö ç ç ÷ ÷ 1 1 + + ( ( ) ) r r 2 2 1 1 è è n n ø ø - - kd kd é ù T 1 nk M B = + + + \ ê ú d ë û Direct k-ary n-cube (Agarwal paper) c = hops nkd = r Fraction of bandwidth = consumed, mBkd = Distance traveled in a dim. in a unidirectional torus k 1 - kd = 2 m traffic rate, [probability of request] = k-ary, n-cube latency
1 U = 1 mT + A better metric • Processor utilization U • Or, how network delay impacts overall performance • U = Fraction of time spent doing useful work • Ideally, • Each instruction takes 1 cycle • If probability of message (remote mem req) on each useful cycle is m and corresponding latency is T Each instruction takes 1+mT cycles or M Net T cycles P
T U T0 T1 U0 m1=U0m0 1 U = 1 mT + m1 T1 T0 m0 BW T m m2 U1 m2 Deriving U is not easy • Notice m = Probability of a message on a useful processor cycle meff = m• U = probability of msg on any cycle T = f (meff)= network delay as a function of m or # of useful processor cycles depends on T and meff • Cyclic dependence!
( ) k 1 r - B d æ 1 ö ç ÷ 1 + ( ) r 2 1 è n ø - kd r mBkdU = 1 1 U U = = 1 1 mT mT + + Processor utilization • k-ary n-cube m= Probability of msg on useful cycle • Channel utilization • Latency • Processor utilization • Two equations [1] and [2] • Two unknowns: T, U • Solving, • Where • and é ù [1] T 1 nk M B = + + + ê ú d ë û [2] T0 Bkd Bkd 2 1 1 1 æ ö ( ) ( ) ç ÷ T 2 B2 k 1 n 1 T = + - + - + + - + 0 d 2 4 2 m 2 2 m è ø nkd M B T = + + 0 unloaded network latency =
How do technological constraints impact network design, performance? Technology Constraints • Constraints • Wire lengths limit maximum speed of clock • # pins limit size of nodes • Bisection width limits # of wires per channel • Software • Can we compile to the architecture? • Can users specify programs? • Is it scalable?
B é ù [ ] ( ) ( ) T T n 1 c hops T = + + + switch wire ê ú W ( n ) ë û 1 n The performance equation becomes • Latency • Recall, • Now, • See paper for details. • Summary: Constraints favor small n. T = [ (1 + c) hops + B] cycle time msg length switch delay wire delay f(n) n channel width: function of bisection width - 1 k 2 E . g µ 1 - k E . g µ N 2 µ 1 n µ N