1 / 34

Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications

Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications. Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University. Site A. Site C. Site B. Networked Computing Utility. Task workflow. A network of clusters or grid sites. Task scheduler.

thais
Download Presentation

Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University

  2. Site A Site C Site B Networked Computing Utility Task workflow • A network of clusters or grid sites Task scheduler • Each site is a pool of heterogeneous resources • Jobs are task workflows C1 • Challenge: choose good resource assignments for the jobs C3 C2

  3. P1 P3 P2 Example: Assigning Resources to Run Tasks • A workflow with a single task home file server • Task input data at Site A C3 Site A • Execution plan Ξ Resource assignment C1 Site C C2 Site B

  4. Plan Enumeration Task workflow Plan Selection Problem Choose Best Plan Cost: Plan Execution Time Challenge:Need cost models to estimate plan execution time

  5. Generating Cost Models is Hard • Non-declarative • Scientific workflow tasks are usually scripts (matlab, perl) • Such tasks are not database operators like join or select • Hence: task is a black box with no prior knowledge • Heterogeneous resources • Computational grid setting • Performance varies a lot across resource assignments • Data dependency • Performance can vary significantly based on properties of input data & parameters to scripts

  6. Problem Setting • Scientific workflows at DSCR (Duke Shared Cluster Resource) • Important scientific workflows are run repeatedly • Opportunity to observe & learn task behavior • Better plan selection for subsequent runs • Sequential scientific workflows • Each task runs on a single node • >90% of workflows at DSCR are sequential

  7. NIMO learns cost models for task workflows End-to-end cost models Incorporate properties of tasks, resources, & data Non-invasive No changes to tasks Automated and active Automatically collects training data for learning cost models Scheduler NIMO Site A C1 C3 C2 Site C Site B NIMO SystemNonInvasive Modeling for Optimization NIMO SystemNonInvasive Modeling for Optimization

  8. NIMO Fills a Gap • WorkFlow Management Systems (WFMSs) • WFMSs use database technology for managing all aspects of scientific workflows [Liu ‘04, Shankar ‘05] • Batch scheduling systems • Knowledge of plan execution time is assumed for optimizing resource assignments [Casanova ‘00, Phan ‘05, Kelly ‘03] NIMO generates cost models for these systems

  9. Roadmap • Cost models • NIMO: active learning of cost models • Experimental evaluation • Related work • Conclusions • Future work

  10. Resource assignment Cost Model Task Task workflow Cost Model for Task Execution time Input data Total workflow execution time can be derived using the cost models for individual tasks

  11. stall phase (compute resource stalled on I/O) compute phase (compute resource busy) Os (stall occupancy) ) ( Od (storage occupancy) + On (network occupancy) + Oa (compute occupancy) T = D * exec. time total data Task Cost Model occupancy: average time spent per unit of data

  12. Resource profile Data profile Cost Model Task Resource assignment Cost Model Execution time T = D * (Oa + On + Od) Input data Task profile

  13. Learning Cost Models Learning the cost model = Learning profiles + Learning predictors

  14. Independent variables Dependent variables Data profile ( ) Resource profile ( ) Statistical Learningof Predictors Ex: Learn each predictor as a regression model from the training data

  15. Cost of sample acquisition Coverage of system operating range Curse of dimensionality Suppose: 10 profile attributes X 10 values per attribute, and 5 minutes for a task run (sample)  We sample 1% of space and build cost model Best accuracy possible Active & Accelerated Learning Accuracy of current best model 951 years! Elapsed Time Challenges in Learning Passive learning

  16. Resource profile Data profile Active (and Accelerated) Learning • Which predictors are important? • Which profile attributes should each predictor have? • What values to consider for each profile attribute during training?

  17. Resource profiler Data profiler WAN emulator (nistnet) Run standard benchmarks NIMO workbench Scheduler Active & Accel. learning Task profiler Training set database Site A C1 C3 C2 Site C Site B NIMO System

  18. Initialization While( ) { } Active Learning Algorithm

  19. Relearn predictors with the new set of training samples Compute current prediction error of each predictor Fixed test set Cross-validation Initialization T4 6 4 1G 512MB 8 256M 10ms While( ) { Pick a new assignment 1GHz Run task on chosen assignment Relearn predictors } Active Learning Algorithm Relearn Predictors

  20. Predictors – fa, fn, fd, fD Order predictors + Traverse this order Ex: relevance-based order (Plackett-Burman) Ex: choose predictor with current max. error Initialization T4 6 4 1G 512MB 8 256M 10ms While( ) { Choose a predictor to refine Choose attributes for the predictor Choose attribute values for the run 1GHz Run task on chosen assignment Relearn predictors } Active Learning Algorithm Predictor Choice

  21. Each predictor takes profile attributes as input Not all attributes are equally relevant Order attributes + Traverse this order Initialization T4 6 4 1G 512MB 8 256M 10ms While( ) { Choose a predictor to refine Choose attributes for the predictor Choose attribute values for the run 1GHz Run task on chosen assignment Relearn predictors } Active Learning Algorithm Attribute Choice

  22. Cover the operating range of attributes Expose main interactions with other attributes Initialization T4 6 4 1G 512MB 8 256M 10ms While( ) { Choose a predictor to refine Choose attributes for the predictor Choose attribute values for the run 1GHz Run task on chosen assignment Relearn predictors } Active Learning Algorithm Value Choice

  23. Experimental Results • Biomedical workflows (from DSCR) • BLAST, fMRI, NAMD, CardioWave • Single task workflows • Plan space in the heterogeneous networked utility • 5 CPU speeds, 6 Network latencies, 5 Memory sizes • 5 X 6 X 5 = 150 resource plans • Goal: Converge quickly to a fairly-accurate cost model • We use regression models for the predictors • Model validation details in previous work (ICAC 2005)

  24. Performance Summary • Error: Mean absolute % error in predicted execution time • A separate test set for evaluating the error

  25. BLAST Application: Predictor Choice

  26. BLAST Application: Attribute Choice

  27. Related Work • Workflow Management Systems (WFMSs) • [Shankar ’05, Liu ’04 etc.] • Performance prediction in scientific applications • [Carrington ’05, Rosti ’02, etc.] • Learning cost models using statistical techniques • [Zhang ’05, Zhu ’96, etc.] • NIMO is end-to-end, noninvasive, and active (acquires model learning data automatically)

  28. Conclusions • NIMO: • Learns cost models for scientific workflows • Noninvasive and end-to-end • Active and accelerated learning: Learns accurate cost models quickly • Fills a gap in Workflow Management Systems

  29. NIMO + SHIRAKO A policy-based resource-leasing system that can slice-and-dice virtualized resources NIMO + Fa Processing system-management queries (e.g., root-cause diagnosis, forecasting performance problems, capacity-planning) Scheduler NIMO Site A C1 C3 C2 Site C Site B Future Work

  30. Backup Slides for Explanation

  31. See Paper for Details of Steps • Each algorithm step has sub-algorithms • Example: Choosing the predictor to refine in current step • Goal: learn most relevant predictors first • Static Vs. dynamic ordering • Static: • Define total order: a priori or using estimates of influence (Plackett-Burman) • Traverse the order: round-robin Vs. improvement-threshold-based • Dynamic: choose the predictor with maximum current prediction error

  32. Active and Accelerated Learning

  33. Latency hiding

  34. Saturation

More Related