210 likes | 219 Views
Status of the EGI O-E-12 Task: Coordination of Network Support for EGI. Mario Reale IGI / GARR mario.reale@garr.it. Contents. O-E-12 definitions and goals O-E-12 status Wrap up of the migration (final phase of EGEE III) Current task tools
E N D
Status of the EGI O-E-12 Task: Coordination of Network Support for EGI Mario Reale IGI / GARR mario.reale@garr.it
Contents • O-E-12 definitions and goals • O-E-12 status • Wrap up of the migration (final phase of EGEE III) • Current task tools • Overview of networking support within individual NGIs • Summary of the EGEE III questionnaire for NGIs for Network Support • Next steps and challenges ahead
O-E-12 Definition and Goals • O-E-12 is the coordination of the network support for EGI • Its goal is providing network support to EGI by • proposing useful synergies and promoting cooperation among EGI.eu, the national NGI efforts and the NRENs community • encouraging the definition and adoption of best practices • proposing common solutions and tools • liaising with the NRENs community and GEANT (DANTE) • Provisioned through the EGI-Inspire tasks TSA1.7 (Support Teams) and TSA1.4 (Grid Management Infrastructure) • Provided with a manpower of 0.5 FTE within the EGI-Inspire project, and an additional contribution from IGI • Fundamental will be the collaboration by NGIs and NRENs
Summary of the original workplan • Perform an initial assessment of the adopted model for network support within each NGI • Further follow up the developments of pS-Lite_TSS for on demand troubleshooting and grid-specific tests on the network • Support its deployment on the EGI/NGIs infrastructure • Possibly exploiting further monitoring tools • Define, jointly with the user community, a subset of the Grid sites belonging to the EGI global infrastructure to be periodically monitored • Excluding a priori a full-mash spanning all sites • Putting in place a workflow for the exchange of information about network faults and scheduled downtimes • Organize the structure of a global PERT support for EGI
Current Status of O-E-12: Summary of the EGEE to EGI transition phase • Transition from EGEE SA2 to EGI O-E-12 implied close collaboration and discussions, especially among GARR, EGEE ENOC in Lyon (CC IN2P3), CNRS UREC in Paris • We identified 2 main tools to keep among the ones provided by ENOC and SA2, plus an additional tool to keep following up for possible future adoption: • PerfSONAR-Lite_TSS • On-demandNetwork monitoring and troubleshooting tool based on perfSONAR • The Downcollector • A central tool to check Grid services registered in the GOC DB on their specific TCP ports • The Grid Job based approach for network monitoring • A system not requiring anly local deployment by sysadmins
DownCollector The DownCollector is a polling tool reporting on the reachability of the services registered in the GOC DB Star-based architecture, Central tool • All tests start from the same initial point It checks services are reachable on the corresponding TCP ports Available at https://ccenoc.in2p3.fr/DownCollector/ Migrated to https://perfsonarlitetss.dir.garr.it/DownCollector/ It will be accessible through a new portal dedicated to the O-E-12 task, which will be available at the URL http://eginet.garr.it • This is NOT YET available. It will be setup in the next days High Availability currently not available • Might be implemented in future if operation will prove usefulness of HA Originally developed by IN2P3 CC-Lyon within EGEE SA2 • In future, endorsed by GARR 6
perfSONAR-lite TroubleShooting Services Site A Probe A 2 - Request ENOC 1 - Request 4 - Result Central server 3 - e2e measurement Users 5 - Result Site B Probe B • Started in EGEE-III, entirely designed by SA2 • Developments lead by DFN/Erlangen as a SA2 partner • Central server orchestrating on demand e2e measurements between light probes hosted by Grid sites • EGEE driven improvements of standard perfSONAR framework • Authentication & Authorisation mapped from GOCDB’s roles 7
PerfSONAR-Lite_TSS Focus on on-demand troubleshooting: ENOC supervisor ROCsmemberssite administrator 1 2 AuthenticationAuthorizationProcess • Launch test on demand from a Grid site under central server control: 2 7 • Bandwidth measurements, DNS lookup, Traceroute, Port testing, Ping ENOC 3 6 5 • is easy to use for the Grid administrators • can be used quickly by site admin without the need to establish each time a contact the remote site involved in the problem 4 Grid site B Grid site A Local site light PerfSONAR’s probe Central ENOC monitoring server Networking Support – Xavier Jeannin - EGEE-III First Review 23-24 June 2010 8
PerfSONAR-Lite_TSS First version was released and installed on 6 sites Installation guide and procedure http://www.dfn.de/en/enhome/x-win/download-of-perfsonar-lite-tss/ FAQ, tutorial, new features (users, sites, ROC management) Software authorization schema was adapted to be able to fit with hierarchical EGI/NGI model Difficult to deploy the software during the transition phase toward EGI Networking Support – Xavier Jeannin - EGEE-III First Review 23-24 June 2010 9
perfSONAR-lite TSS: outlook Expected users: Sites, ROCs, ENOC... Status: Tool basically ready, but missing maturation phase • Suffered some staff movements and licensing issues • Not yet fully in production but distributed testbed in place • First production release released at the end of March Future: • Wrap up on current status and initial deployment strategy within the EGI required • O-E-12 will follow up and organize dedicated pre-production deployment campaigns in the next weeks • Future developments to further improve security related to available bandwidth tests and simply AA • May be followed and used outside EGI • DFN and CNRS declared their interest in following up the tool 11
Grid Job based approach to monitoring • Within EGEE SA2 a development started to exploit an approach to Network Monitoring for the Grid based on the Grid Jobs • “Monitor the Grid using the Grid” • The main advantage of this approach is that Grid site adminitrators don’t have to deploy anything • Only accepting 2 jobs permanently running from a specific VO • This approach was conceived especially thinking of the minor and medium-size EGEE sites, with limited resources and attendance/manpower • EGEE SA2 produced a prototype deployed on a testbed of 8 sites in France and Italy • Main developers are Etienne Double / CNRS UREC and Alfredo Pagano / GARR • Structure, example, issues, options will be further described in another presentation by O-E-12
Job-based Network monitoring for Grid Grid network monitoring jobs Monitoring server@ Urec CNRS DB 1 www request Front-end@GARR DB 2 Monitoring server@ Urec CNRS Monitoring server@ ROC1 – Server A Possibleevolutions DB ROC1 Monitoring server@ ROC1 – Server B Frontend: Apache Tomcat, Ajax, Google Web Toolkit (GWT) Backend: PostgreSQL Implementation languages: Python, bash script
Assessment of the current model for network support within the NGIs • EGEE SA2 contributed with 3 questions to the Questionnaire for the NGIs (operations): • Do you expect to nominate a network representative who can be the contact point for the collaboration with the Network Support task at EGI level ? • Could you shortly describe what is your current operational model for network related tickets and issues ? • Have you contacted your NREN to participate to the Network Support task ? (if yes, provide details) • As predictable, we got a large variety of different answers and amount of provided information
First highlights from the Questionnaire • 32 organizations (31 NGIs + CERN) answered • As of today: • 13 provided the email address of a contact person/team for the Network Support task • 14 answered they will appoint someone (or will possibly do it) • 5 answered they will not, or they haven’t decide yet, or they did not answer • In 28 cases the NGI and the NREN are interconnected with already established workflows for network related issues (1 not applicable:CERN) • We will further analyse into more detail the outcome and provide a summary document to SA1 and O-E-12/Network Support contacts
Challenges ahead • Get the Network Support task fully supported by all NGIs to • Involve NGIs in a reasonable roadmap towards the achievements of the O-E-12/Network Support goals • The real task challenge is the Multi Domain/Cross domain e-2-e related network support • People should discuss, agree and act on common goals • We consider this the first major achievement for the O-E-12 task • What shall we focus on ? • We proposed something: • Sharing of information on scheduled downtimes and observed faults • 3 tools to keep working on, exploiting them on a larger set of sites • Organize a general workflow for observed, percepted performances issue organizing at the EGI level a unique entry point for PERT support, able to properly handle, route/ escalate the issues • Defining – together with the VRC/VOs – a subset of relevant sites for which possibly set up periodic and systematic NM measurements
The fundamental O-E-12 Trade Off • There is a general trade off to keep in mind: • Doing essentially nothing • “The Network works…and anyhow If I have a problem, I know myself whom to call.. • Doing too much, trying to provide too much information, which normally means eventually no useful information • Shooting IPERFlike tests everywhere, to the full mash of sites • Providing all tickets related to all NRENs in all possible languages to a unique unfortunate team, in charge of informing everyone that the Institute for Submarine Research of The univeristy of Nowherecity, in the country of TheresEvenMe-Land will possibly have an electric power cut next Thurday, after having been able to translate and understand the original ticket
Challenges ahead / brainstorming • We would like to see useful tools agreed upon and adopted by the NGIs • We would like to provide useful information, only useful information, only when required, to essentialy everyone in need of it • Can we envisage a general tool and the corresponding required level of standardization to be able to provide to “everyone” the binary (1/0) information about the network reachability of a specific Grid site ? • Going beyond the “modelling specific workflows for specific Grid projects”? • In other words: would it be able to provide to Grid managers information on possible network problems related to a specific site ?
What has been achieved so far • Migration plans successfully completed: • pS-Lite_TSS server, DownCollector, BugZilla already in place • Started liasing with GEANT3/DANTE to formalize the EGI-GEANT collaboration • Discussed on GN3 MB on May 5, 2010 • Identified areas for collaboration: • Security • AAI • perfSONAR and interdomain tools • Got an initial set of contacts and a very sketchy, draft idea of what the various NGIs are internally doing w.r.t. network support • But this required further work and peer-to-peer communication
Still Missing / Next steps • Create a full-fledged portal for network support • Including contacts/ wiki / documents / access to the tools • May be sticking it to the domain netsup.egi.eu ? • For the moment we will start by eginet.garr.it • Plan the further development and deployment startegies for • PerfSONAR-Lite_TSS • Grid Job based approach to Network Monitoring for Grids • Get new NGIs and new sites involved about them • Organize the NRENs and NGIs established communication channels / fora aimed at defining an agreed strategy for the Multi Domain and the concrete tools / steps / workflows the NRENs will provide to EGI/NGIs : • NRENs&NGIs event? • Periodical VideoConferences involving NGIs and NRENs ?
Thank you. Questions ?