170 likes | 279 Views
Instrumenting Folding@Work. Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel. Overview. Problem Description Experimental Structure Folding@Work Workflow Benchmarks Results Weak Scaling (ns / day) Server Capacity Available Workers Over Time
E N D
InstrumentingFolding@Work Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel
Overview • Problem Description • Experimental Structure • Folding@Work Workflow • Benchmarks • Results • Weak Scaling (ns / day) • Server Capacity • Available Workers Over Time • Variability of Computation Time • Conclusions
Benchmarks • Tasks: 1 ns generations (approx 2 hr on test machine) • 10 consecutive generations / simulations • Weak Scaling • 10 simulations / 10 workers • 100 simulations / 100 workers • 1,000 simulations / 1,000 workers • Condor, later added SGE jobs • 1 Trial of each; Took ~ 2 days to run
Identified Areas of Improvement • Availibility of Resources • Benchmarks limited by number of sustained workers available through Condor • New feature: WorkQueue Worker Pool can be used to start new workers • WorkQueue Limits Number of Workers • Increasing number of file descriptors allowed up to 2,500 workers to connect • Bad behavior occuring in calls to select() • Working with WorkQueue developers to switch to poll() • Long-Running Work Units Delay Completion of Trajectories • Some work units not returned / taking very long time • Prevents trajectories from finishing • Use fast abort feature to re-assign work units that take longer than a specified time
Conclusion • Accomplished • Identified key metrics (ns / day, wait time) • Developed scaling model • Tested model • Conclusions • Real scientific applications scale well • Forcing short workunits adds load to Master • Performance model validated • “Self-correcting” behavior