430 likes | 457 Views
Performance Measurement. Assignment? Timing. #include <sys/time.h> double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec * 1e-6); }. A Quantitative Basis for Design. Parallel programming is an optimization problem.
E N D
Performance Measurement • Assignment? • Timing #include <sys/time.h> double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec * 1e-6); }
A Quantitative Basis for Design • Parallel programming is an optimization problem. • Must take into account several factors: • execution time • scalability • efficiency
A Quantitative Basis for Design • Parallel programming is an optimization problem. • Must take into account several factors: • Also must take into account the costs: • memory requirements • implementation costs • maintenance costs etc.
A Quantitative Basis for Design • Parallel programming is an optimization problem. • Must take into account several factors: • Also must take into account the costs: • Mathematical performance models are used to asses these costs and predict performance.
Defining Performance • How do you define parallel performance? • What do you define it in terms of? • Consider • Distributed databases • Image processing pipeline • Nuclear weapons testbed
Amdahl's Law • Every algorithm has a sequential component. • Sequential component limits speedup Maximum Speedup Sequential Component = 1/s = s
Amdahl's Law s Speedup
What's wrong? • Works fine for a given algorithm. • But what if we change the algorithm? • We may change algorithms to increase parallelism and thus eventually increase performance. • May introduce inefficiency
Metrics for Performance • Efficiency • Speedup • Scalability • Others …………..
T1 = E pTp Efficiency The fraction of time a processor spends doing useful work • What about when pTp < T1 • Does cache make a processor work at 110%?
Speedup What is Speed? Speed 1 What algorithm for Speed1? = S SpeedP What is the work performed? How much work?
Two kinds of Speedup • Relative • Uses parallel algorithm on 1 processor • Most common • Absolute • Uses best known serial algorithm • Eliminates overheads in calculation.
Speedup • Algorithm A • Serial execution time is 10 sec. • Parallel execution time is 2 sec. • Algorithm B • Serial execution time is 2 sec. • Parallel execution time is 1 sec. • What if I told you A = B?
Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of logic is the syllogism, consisting of a major and minor premise and a conclusion.
Example • Major Premise: Sixty men can do a piece of work sixty times as quickly as one man. • Minor Premise: One man can dig a post-hole in sixty seconds. • Conclusion: Sixty men can dig a post-hole in one second.
Performance Analysis Statements • There is always a trade-off between time and solution quality. • We should compare the quality of the answer for a given execution time. • For any performance reporting, find and clearly state the quality measure.
Speedup • Conventional speedup is defined as the reduction in execution time. • Consider running a problem on a slow parallel computer and on a faster one. • Same serial component • Speedup will be lower on the faster computer.
Speedup and Amdahl's Law • Conventional speedup penalizes faster absolute speed. • Assumption that task size is constant as the computing power increases results in an exaggeration of task overhead. • Scaling the problem size reduces these distortion effects.
Solution • Gustafson introduces scaled speedup. • Scale the problem size as you increase the number of processors. • Calculated in two ways • Experimentally • Analytical models
Traditional Speedup T ( N ) = 1 Speedup T ( N ) P T1 is time taken on a single processor TP is time taken on P processors
T ( PN ) = 1 Speedup T ( PN ) P Scaled Speedup T1 is time taken on a single processor TP is time taken on P processors
Traditional Speedup ideal Speedup measured Number of Processors
Scaled Speedup Large Problem ideal Speedup Medium problem Small problem Number of Processors
Performance Measurement • There is not a perfect way to measure and report performance. • Wall clock time seems to be the best. • But how much work do you do? • Best Bet: • Develop a model that fits experimental results.
A Parallel Programming Model • Goal: Define an equation that predicts execution time as a function of • Problem size • Number of processors • Number of tasks • Etc. = T f ( N , P ,....)
A Parallel Programming Model • Execution time can be broken up into • Computing • Communicating • Idling æ ö - - - P 1 P 1 P 1 1 å å å = + + ç ÷ i i i T T T T comp comm idle P è ø = = = i 0 i 0 i 0
Computation Time • Normally depends on problem size • Also depends on machine characteristics • Processor speed • Memory system • Etc. • Often, experimentally obtained
Communication Time • The amount of time spent sending & receiving messages • Most often is calculated as • Cost of sending a single message * #messages • Single message cost • T = startuptime + time_to_send_one_word * #words
Idle Time • Difficult to determine • This is often the time waiting for a message to be sent to you. • Can be avoided by overlapping communication and computation.
´ ´ n n z Finite Difference Example • Finite Difference Code • 512 x 512 x 5 Elements • Nine-point stencil • Row-wise decomposition • Each processor gets n/p*n*z elements • 16 IBM RS6000 workstations • Connected via Ethernet
Finite Difference Model • Execution Time (per iteration) • ExTime = (Tcomp + Tcomm)/P • Communication Time (per iteration) • Tcomm = 2 (lat + 2*n*z*bw) • Computation Time • Estimate using some sample code
What was wrong? • Ethernet • Shared bus • Change the computation of Tcomm • Reduce the bandwith • Scale the message volume by the number of processors sending concurrently. • Tcomm = 2 (lat + 2*n*z*bw * P/2)
Using analytical models • Examine the control flow of the algorithm • Find a general algebraic form for the complexity (execution time). • Fit the curve with experimental data. • If the fit is poor, find the missing terms and repeat. • Calculate the scaled speedup using formula.
+ C ( PN ) 2 12 ( 4 ( 128 )) 6146 = = = 1 3 . 93 + + C ( PN ) 4 12 ( 4 ( 128 ) / 4 ) 5 ( 4 ) 1560 P Example • Serial Time = 2 + 12 N seconds • Parallel Time = 4 + 12 N/P + 5P seconds • Let N/P = 128 • Scaled Speedup for 4 processors is:
Performance Evaluation • Identify the data • Design the experiments to obtain the data • Report data
Performance Evaluation • Identify the data • Execution time • Be sure to examine a range of data points • Design the experiments to obtain the data • Report data
Performance Evaluation • Identify the data • Design the experiments to obtain the data • Make sure the experiment measures what you intend to measure. • Remember: Execution time is max time taken. • Repeat your experiments many times • Validate data by designing a model • Report data
Performance Evaluation • Identify the data • Design the experiments to obtain the data • Report data • Report all information that affects execution • Results should be separate from Conclusions • Present the data in an easily understandable format.