380 likes | 391 Views
Paradyn/Condor Week 2005 March 2005. UAB. Dynamic Tuning of Master/Worker Applications. Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat Autònoma de Barcelona. Outline. Introduction MATE Number of workers Data distribution
E N D
Paradyn/Condor Week 2005 March 2005 UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat Autònoma de Barcelona
Outline • Introduction • MATE • Number of workers • Data distribution • Conclusions
Outline • Introduction • MATE • Number of workers • Data distribution • Conclusions
Introduction Application performance • The main goal of parallel/distributed applications: solve a considered problem in the possible fastest way • Performance is one of the most important issues • Developers must optimize application performance to provide efficient and useful applications
Introduction (II) • Difficulties in finding bottlenecks and determining their solutions for parallel/distributed applications • Many tasks that cooperate with each other • Application behavior may change on input data or environment • Difficult task especially for non-expert users
Outline • Introduction • MATE • Number of workers • Data distribution • Conclusions
Problem / Solution Application development Source User Application Execution Performance data Monitoring Tuning Tool Events Performance analysis MATE • Monitoring, Analysis and Tuning Environment • Dynamic automatic tuning of parallel/distributed applications Modifications DynInst Instrumentation
MATE (II) Machine 1 Machine 2 pvmd pvmd modif. AC AC Task1 Task3 Task2 DMLib DMLib DMLib instr. instr. events • Application Controller - AC • Dynamic Monitoring Library - DMLib • Analyzer events Machine 3 Analyzer
MATE (II) • Analyzer • Carries out the application performance analysis • Detects problems “on the fly” and requests changes Machine 1 Machine 2 pvmd pvmd modif. AC AC Task1 Task3 Task2 DMLib DMLib DMLib instr. instr. events • Application Controller - AC • Dynamic Monitoring Library - DMLib • Analyzer events Machine 3 Analyzer
MATE (II) Machine 1 Machine 2 pvmd pvmd modif. AC AC Task1 Task3 Task2 DMLib DMLib DMLib instr. instr. • Application Controller (AC) • Controls the execution of the application • Has a Monitor module to manage instrumentation via DynInst and gather execution information • Has a Tuner module to perform tuning via DynInst events • Application Controller - AC • Dynamic Monitoring Library - DMLib • Analyzer events Machine 3 Analyzer
MATE (II) • Dynamic Monitoring Library (DMLib) • Facilitates the instrumentation and data collection • Responsible for registration of events Machine 1 Machine 2 pvmd pvmd modif. AC AC Task1 Task3 Task2 DMLib DMLib DMLib instr. instr. events • Application Controller - AC • Dynamic Monitoring Library - DMLib • Analyzer events Machine 3 Analyzer
MATE (III) • Automatic performance Analysis on the fly • Find bottlenecks among events applying performance model • Find solutions that overcome bottlenecks • Analyzer is provided with an application knowledge about performance problems • Information related to one problem is called a tuning technique • A tuning technique describes a complete performance optimization scenario
Analyzer Tunlet Performance model Measure points Tuning point, action, sync MATE (IV) • Each tuning technique is implemented in MATE as a “tunlet” • A tunlet is a C/C++ library dynamically loaded to the Analyzer process • measure points – what events are needed • performance model – how to determine bottlenecks and solutions • tuning actions/points/synchronization - what to change, where, when
thread Events (from DMLibs) via TCP/IP MetaData (from ACs) via TCP/IP Tuning request (to tuner) via TCP/IP Event Collector Controller AC Proxy Instrument. request (to monitor) via TCP/IP Event Repository DTAPI Application model Tunlet Tunlet Tunlet MATE (V)
Outline • Introduction • MATE • Number of workers • Data distribution • Conclusions
Master Worker Worker Worker Worker Number of Workers • Master/Worker paradigm • Easy to understand concept, but with some bottlenecks • Example: inadequate number of workers • - workers master idle • + workers + communication
iftl > + then else Number of Workers (II) • Execution Trace of an Homogeneous Master-Worker Application • (where are homogeneous: • message size • workers execution time) Master Workers Where... tl = latency λ = inverse bandwidth vi = size of tasks sent to worker i, in bytes. n = current number of workers in the application.
Master Workers tci Number of Workers (II) • Execution Trace of an Homogeneous Master-Worker Application • (where are homogeneous: • message size • workers execution time) Where... tci = time that worker i spends processing a task
Master Workers tl + λ*vm Number of Workers (II) • Execution Trace of an Homogeneous Master-Worker Application • (where are homogeneous: • message size • workers execution time) Where... tl = latency λ = inverse bandwidth vm = size of results sent back to master
Machi ne A (master) Machine B (worker) send (entry) receive (entry) send (exit) receive (exit) send (entry) receive (entry) send (exit) receive ( exit ) time time Number of Workers: Tunlet • Measure points: • The amount of data sent to the workers and received by the master • The total computational time of workers • The network overhead and bandwidth
Number of Workers: Tunlet (II) • Performance function: • Calculation of the optimal number of workers: • Tuning actions: • To change the value of “numworkers” to add or remove as many workers as is needed
Experimentation • Example application • Forest Fire Propagation simulator – Xfire • Intensive computing application Master/Worker • Simulation of the fireline propagation • Calculates the next position of the fireline considering the current fireline position and weather factors, vegetation,etc. • Platform • Cluster of Pentium 4, 1.8Ghz, SuSE Linux 8.0, connected by 100Mb/sec network
Experimentation (II) • Load in the system • We designed different external load patterns • They simulate the system’s time-sharing • Allow us to reproduce experiments • Case Studies • Xfire executed with different fixed number of workers without any tuning, introducing external loads • Xfire executed under MATE, introducing external loads
Starts with 1 worker and adapts it 1400 1200 1000 800 Execution time (Sec.) 600 400 200 0 1 2 4 6 8 10 12 14 16 18 20 22 24 26 Xf+MATE Case studies Experimentation (III) • Note that... • Execution time of Xfire under MATE is close to the best execution times obtained. • Resources devoted to the application using MATE, are used when they are really needed.
Experimentation (IV) • Statically, the model fits • Dynamically, there are some problems • Nopt Could be extremely high • Computation power added or removed may be not significant considering the previous computational power • Solution • Finding a “reasonable” number of workers that define a trade off between resources utilization and execution time.
Outline • Introduction • MATE • Number of workers • Data distribution • Conclusions
Master Workers Data Distribution • Imbalance Problem: • Heterogeneous computing and communication powers • Varying amount of distributed work Unbalanced iteration Balanced iteration
Data Distribution (II) • Goal: • minimize the idle time by balancing the work among the processes considering efficiency of machines • Performance Model • Factoring Scheduling method • Work is divided into different-size tuples according to the factor
Data Distribution: Tunlet • Measure points: • The work unit processing time. • The latency and bandwidth • Performance function: • Calculation of the factor. • Analyzer simulates the execution considering different factors. Finally, it decides the best factor. • Currently we are working on an analytical model to determine the factor • Tuning actions: • To change the value of “TheFactorF”
Experimentation • Example application • Forest Fire Propagation simulator – Xfire • Platform • Cluster of Pentium 4, 1.8Ghz, SuSE Linux 8.0, connected by 100Mb/sec network
Experimentation (II) • Load in the system • We designed different external load patterns • They simulate the system’s time-sharing • Permit us to reproduce experiments • Study Cases • Xfire executed without any tuning • Xfire, introducing controlled variable external loads • Xfire executed under MATE, introducing variable external loads
18000 16000 14000 12000 10000 8000 Execution time (Sec.) 6000 4000 2000 0 Xfire 1 2 4 8 16 30 Xfire+Load Number of Workers Xfire+Load+MATE Experimentation (III) • Note that… • Introduction of an extra load increases the execution time. • Execution with MATE corrects the factor value to improve the execution time
Outline • Introduction • MATE • Number of workers • Data distribution • Conclusions
Conclusions and open lines • Conclusions • Prototype environment – MATE – automatically monitors, analyses and tunes running applications • Practical experiments conducted with MATE and parallel/distributed applications prove that it automatically adapts application behavior to existing conditions during run time • MATE in particular is able to tune Master/Worker applications and overcome the possible bottlenecks: number of workers and data distribution • Dynamic tuning works, is applicable, effective and useful in certain conditions.
Conclusions and open lines • Open Lines • Determining the “reasonable” number of workers. • Considering interaction between different tunlets. • Providing the system with other tuning techniques.