1 / 27

PanDA Status Report

PanDA Status Report. Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014. Overview. We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far PanDA work started ~1 year ago Plans for completion of current work

palma
Download Presentation

PanDA Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, NashvilleMay 13, 2014

  2. Overview • We are nearing end of ANSE project ~6 months • Review goals/scope of PanDA work in ANSE • Assess progress so far • PanDAwork started ~1 year ago • Plans for completion of current work • Plans for new work • Discuss tomorrow • Synergy with other projects • Artem is co-funded by DOE-ASCR BigPanDA project • BigPanDA continues for ~9 months after ANSE ends • What happens after 2015? Kaushik De

  3. PanDA Goals • Explicit integration of Networking with PanDA • Never before attempted for any WMS • PanDA has many implicit assumptions about networking • Goal 1: Use network information directly in PanDA workflow • Goal 2: Attempt direct control (provisioning) through PanDA • ANSE + DOE-ASCR • Picked few well defined topics • Set up infrastructure and interactions with other projects • Develop and deploy software • Evaluation metrics • Deliver new capabilities forLHC experiments • This is not only R&D – use in production environment Kaushik De

  4. PanDA Steps • Collect network information • Storage and access • Using network information • Using dynamic circuits Kaushik De

  5. Sources of Network Information • DDM Sonar measurements • Actual transfer rates for files between all sites (Tier 1 and Tier 2) • This information is normally used for site white/blacklisting • Measurements available for small, medium, and large files • perfSonar (PS) measurements • perfSonar provides dedicated network monitoring data • All WLCG sites are being instrumented with PS boxes • US sites are already instrumented and monitored • Federated XRootD (FAX) measurements • Read-time ofremote files are measured for pairs of sites • This is not an exclusive list – just a starting point http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false Kaushik De

  6. DDM Sonar Kaushik De

  7. perfSonar Kaushik De

  8. FAX Kaushik De

  9. Kaushik De

  10. Data Repositories • Three levels of data storage and access • Native data repositories • Historical data stored from collectors • SSB – site status board for sonar and perfSonar data • FAX data is kept independently and uploaded • AGIS (ATLAS Grid Information System) • Most recent / processed data only – updated periodically • Mixture of push/pull – moving to JSON API (pushed only) • schedConfigDB • Internal Oracle DB used by PanDA for fast access • Uses standard ATLAS collector Kaushik De

  11. Kaushik De

  12. Using Network Information • Pick a few use cases • Important to PanDA users • Enhance workload management through use of network • Should provide clear metrics for success/failure • Case 1: Improve User Analysis workflow • Case 2: Improve Tier 1 to Tier 2 workflow Kaushik De

  13. Improving User Analysis • In PanDA, user jobs go to data • Typically, user jobs are IO intensive – hence constrain jobs to data • Note - almost any user payload is allowed by PanDA • User analysis jobs are routed automatically to T1/T2 sites • For popular data, bottlenecks develop • If data isonly at a few sites, user jobs have long wait times • PD2P was implemented 3 years ago to solve this problem • Additional copies are made asynchronously by PanDA • Waiting jobs are automatically re-brokered to new sites • But bottlenecks still take time to clear up • Can we do something else using network information? • Why not use FAX? • First we need to develop network metrics for efficient use of FAX Kaushik De

  14. Faster User analysis through FAX • First use case for network integration with PanDA • PanDA brokerage will use concept of ‘nearby’ sites • Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…) • Add network transfer cost to brokerage weight • Jobs will be sent to the site with best weight – not necessarily the site with local data • If nearby site has less wait time, access the data through FAX Kaushik De

  15. First Tests • Tested in production for ~1 day in March, 2014 • Useful for debugging and tuning direct access infrastructure • We got first results on network aware brokerage • Job distribution • 4748 jobs from 20 user tasks which required data from congested U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites Kaushik De

  16. Brokerage Results Kaushik De

  17. Conclusions for Case 1 • Network data collection working well • Additional algorithms to combine network data will be tried • HC tests working well – but PS data not robust yet • PanDA brokerage worked well • Achieved goal of reducing wait time • Well balanced local vs remote access • Will fine tune after more data on performance • Waiting for final implementation • But we have no data on actual performance of successful jobs • Need to test and validate sites for this mode of data access • First tests in March had 100% failure rate (FAX deployment related) • Second test 1 week ago also did not go well • Expect third test soon Kaushik De

  18. Managing Data Rates • Tests have shown direct access rates need to be managed • Parameters for WAN throttling implemented in PanDA • Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution • Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future) • Throttling may also be done at pilot level • PanDAhas implemented a mixed approach to throttling, being tested now Kaushik De

  19. Cloud Selection • Second use case for network integration with PanDA • Optimize choice of T1-T2 pairings (cloud selection) • In ATLAS, production tasks are assigned to Tier 1’s • Tier 2’s are attached to a Tier 1 cloud for data processing • Any T2 may be attached to multiple T1’s • Currently, operations team makes this assignment manually • This could/should be automated using network information • For example, each T2 could be assigned to a native cloud by operations team, and PanDA will assign to other clouds based on network performance metrics Kaushik De

  20. DDM Sonar Data http://aipanda021.cern.ch/networking/t1tot2d_matrix/ Kaushik De

  21. Tier 1 View Kaushik De

  22. More T1 Information Kaushik De

  23. Tier 2 View Kaushik De

  24. Improving Site Association Kaushik De

  25. More T2 Information Kaushik De

  26. Conclusion for Case 2 • Working well in real time • Currently implementing archival information • Keep data for last ‘n’ Tier 1 – Tier 2 associations • Necessary to check robustness of approach • Algorithm may use the historical information in the future • Expect to deploy this summer • Hopefully ~1 month Kaushik De

  27. Summary • First 2 use cases for network integration with PanDA working well • Work will be completed this summer • Metrics showing usefulness of approach will be available in Fall • On track for timely final report to ANSE Kaushik De

More Related