1 / 17

Instrumenting Folding@Work

Instrumenting Folding@Work. Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel. Overview. Problem Description Experimental Structure Folding@Work Workflow Benchmarks Results Weak Scaling (ns / day) Server Capacity Available Workers Over Time

lou
Download Presentation

Instrumenting Folding@Work

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. InstrumentingFolding@Work Badi Abdul-Wahid, RJ Nowling CSE 60641 Operating Systems Professor Striegel

  2. Overview • Problem Description • Experimental Structure • Folding@Work Workflow • Benchmarks • Results • Weak Scaling (ns / day) • Server Capacity • Available Workers Over Time • Variability of Computation Time • Conclusions

  3. Experimental Structure

  4. Folding@Work Workflow

  5. Benchmarks • Tasks: 1 ns generations (approx 2 hr on test machine) • 10 consecutive generations / simulations • Weak Scaling • 10 simulations / 10 workers • 100 simulations / 100 workers • 1,000 simulations / 1,000 workers • Condor, later added SGE jobs • 1 Trial of each; Took ~ 2 days to run

  6. Weak Scaling of F@W

  7. Server Capacity (Wait Time)

  8. Available Workers over Time

  9. Transfer Times

  10. Variability of Computation Time

  11. Example Execution Timeline

  12. Performance Model

  13. Weak Scaling (updated)

  14. Wait Times

  15. Tasks Waiting

  16. Identified Areas of Improvement • Availibility of Resources • Benchmarks limited by number of sustained workers available through Condor • New feature: WorkQueue Worker Pool can be used to start new workers • WorkQueue Limits Number of Workers • Increasing number of file descriptors allowed up to 2,500 workers to connect • Bad behavior occuring in calls to select() • Working with WorkQueue developers to switch to poll() • Long-Running Work Units Delay Completion of Trajectories • Some work units not returned / taking very long time • Prevents trajectories from finishing • Use fast abort feature to re-assign work units that take longer than a specified time

  17. Conclusion • Accomplished • Identified key metrics (ns / day, wait time) • Developed scaling model • Tested model • Conclusions • Real scientific applications scale well • Forcing short workunits adds load to Master • Performance model validated • “Self-correcting” behavior

More Related