270 likes | 460 Views
PanDA Status Report. Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014. Overview. We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far PanDA work started ~1 year ago Plans for completion of current work
E N D
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, NashvilleMay 13, 2014
Overview • We are nearing end of ANSE project ~6 months • Review goals/scope of PanDA work in ANSE • Assess progress so far • PanDAwork started ~1 year ago • Plans for completion of current work • Plans for new work • Discuss tomorrow • Synergy with other projects • Artem is co-funded by DOE-ASCR BigPanDA project • BigPanDA continues for ~9 months after ANSE ends • What happens after 2015? Kaushik De
PanDA Goals • Explicit integration of Networking with PanDA • Never before attempted for any WMS • PanDA has many implicit assumptions about networking • Goal 1: Use network information directly in PanDA workflow • Goal 2: Attempt direct control (provisioning) through PanDA • ANSE + DOE-ASCR • Picked few well defined topics • Set up infrastructure and interactions with other projects • Develop and deploy software • Evaluation metrics • Deliver new capabilities forLHC experiments • This is not only R&D – use in production environment Kaushik De
PanDA Steps • Collect network information • Storage and access • Using network information • Using dynamic circuits Kaushik De
Sources of Network Information • DDM Sonar measurements • Actual transfer rates for files between all sites (Tier 1 and Tier 2) • This information is normally used for site white/blacklisting • Measurements available for small, medium, and large files • perfSonar (PS) measurements • perfSonar provides dedicated network monitoring data • All WLCG sites are being instrumented with PS boxes • US sites are already instrumented and monitored • Federated XRootD (FAX) measurements • Read-time ofremote files are measured for pairs of sites • This is not an exclusive list – just a starting point http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false Kaushik De
DDM Sonar Kaushik De
perfSonar Kaushik De
FAX Kaushik De
Data Repositories • Three levels of data storage and access • Native data repositories • Historical data stored from collectors • SSB – site status board for sonar and perfSonar data • FAX data is kept independently and uploaded • AGIS (ATLAS Grid Information System) • Most recent / processed data only – updated periodically • Mixture of push/pull – moving to JSON API (pushed only) • schedConfigDB • Internal Oracle DB used by PanDA for fast access • Uses standard ATLAS collector Kaushik De
Using Network Information • Pick a few use cases • Important to PanDA users • Enhance workload management through use of network • Should provide clear metrics for success/failure • Case 1: Improve User Analysis workflow • Case 2: Improve Tier 1 to Tier 2 workflow Kaushik De
Improving User Analysis • In PanDA, user jobs go to data • Typically, user jobs are IO intensive – hence constrain jobs to data • Note - almost any user payload is allowed by PanDA • User analysis jobs are routed automatically to T1/T2 sites • For popular data, bottlenecks develop • If data isonly at a few sites, user jobs have long wait times • PD2P was implemented 3 years ago to solve this problem • Additional copies are made asynchronously by PanDA • Waiting jobs are automatically re-brokered to new sites • But bottlenecks still take time to clear up • Can we do something else using network information? • Why not use FAX? • First we need to develop network metrics for efficient use of FAX Kaushik De
Faster User analysis through FAX • First use case for network integration with PanDA • PanDA brokerage will use concept of ‘nearby’ sites • Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…) • Add network transfer cost to brokerage weight • Jobs will be sent to the site with best weight – not necessarily the site with local data • If nearby site has less wait time, access the data through FAX Kaushik De
First Tests • Tested in production for ~1 day in March, 2014 • Useful for debugging and tuning direct access infrastructure • We got first results on network aware brokerage • Job distribution • 4748 jobs from 20 user tasks which required data from congested U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites Kaushik De
Brokerage Results Kaushik De
Conclusions for Case 1 • Network data collection working well • Additional algorithms to combine network data will be tried • HC tests working well – but PS data not robust yet • PanDA brokerage worked well • Achieved goal of reducing wait time • Well balanced local vs remote access • Will fine tune after more data on performance • Waiting for final implementation • But we have no data on actual performance of successful jobs • Need to test and validate sites for this mode of data access • First tests in March had 100% failure rate (FAX deployment related) • Second test 1 week ago also did not go well • Expect third test soon Kaushik De
Managing Data Rates • Tests have shown direct access rates need to be managed • Parameters for WAN throttling implemented in PanDA • Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution • Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future) • Throttling may also be done at pilot level • PanDAhas implemented a mixed approach to throttling, being tested now Kaushik De
Cloud Selection • Second use case for network integration with PanDA • Optimize choice of T1-T2 pairings (cloud selection) • In ATLAS, production tasks are assigned to Tier 1’s • Tier 2’s are attached to a Tier 1 cloud for data processing • Any T2 may be attached to multiple T1’s • Currently, operations team makes this assignment manually • This could/should be automated using network information • For example, each T2 could be assigned to a native cloud by operations team, and PanDA will assign to other clouds based on network performance metrics Kaushik De
DDM Sonar Data http://aipanda021.cern.ch/networking/t1tot2d_matrix/ Kaushik De
Tier 1 View Kaushik De
More T1 Information Kaushik De
Tier 2 View Kaushik De
Improving Site Association Kaushik De
More T2 Information Kaushik De
Conclusion for Case 2 • Working well in real time • Currently implementing archival information • Keep data for last ‘n’ Tier 1 – Tier 2 associations • Necessary to check robustness of approach • Algorithm may use the historical information in the future • Expect to deploy this summer • Hopefully ~1 month Kaushik De
Summary • First 2 use cases for network integration with PanDA working well • Work will be completed this summer • Metrics showing usefulness of approach will be available in Fall • On track for timely final report to ANSE Kaushik De