150 likes | 152 Views
This document discusses the assessment of network needs and metrics for monitoring the EGEE project, including troubleshooting, identifying bottlenecks, and improving application performance.
E N D
EGEE-Monitoring2nd EGEE Technical Network Liaison Committee Xavier Jeannin (CNRS/UREC Paris, FR) 24 February 2009
Contents SA2 projects • Assessment of the network needs, RedIRIS, topology • PerfSONAR-Lite TSS EGEE-III monitoring • Why the monitoring • SA2 approach during EGEE-III • EGEE constraints • What the monitoring will be used for? • Metrics • OPEN problems • Approach • NRENs participation • Next step EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Assessment of the network needs for a Grid site CIEMAT IFCA USC UNICAN UMA CESGA UV TIER 2 EB-Santander0 EB-Bilbao0 TIER 2 TIER 1 EB-Santiago0 TIER 2 PIC UB Regional Network EB-Iris4 GW-Barcelona0 Anella GW-Nacional2 RedIRIS GW-Madrid0 EB-Barcelona0 CAM UAB GW-Nacional1 GW-Valencia0 TIER 2 UAM EB-Madrid0 TIER 2 TIER 2 EB-Iris2 TIER 2 RedIRIS Switch TIER 2 GW-Sevilla0 TIER 2 SW-Tenerife2 RedIRIS Router EB-Tenerife0 Spanish TIER2 Architecture - Beta 1
PerfSONAR-Lite TSS • Network troubleshooting tool • Launch test on demand from the Grid site under central server control: ping, traceroute, DNS lookup, nmap and bandwith measurements ENOC supervisor orsite administrator 1 2 6 3 ENOC 5 4 administrator Grid site B Grid site A Local site light PerfSONAR’s sensor Central ENOC monitoring server EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Why monitoring • Foundation of the relationship between the applications (the users), the middleware and the network • Troubleshooting • Identify site and application needs and network performance • increase application performance • Identify bottleneck • Indentify QOS needs • “Applications look network as only able to deliver best effort and grids applications are designed to bear it” EGEE monitoring — X. Jeannin — TNLC Feb. 2009
SA2 approach during EGEE-III • Investigate and try to design a solution by the end of EGEE-III according strong constraints of EGEE (300 sites, etc…) which will be implemented during EGI-NGI • A working group: Application (Users), Sites, NRENs, SA2 • Lot of works have been already done and should be reused • In DataGrid, EGEE (https://edms.cern.ch/document/695235/2) • Project DORII (DSA1.1) • GEANT (http://monstera.man.poznan.pl/jra1-wiki/images/5/50/GN2-05-265_Deliverable_DJ1.2.3-v2.0.doc) • GGF (http://www.gridforum.org/documents/GWD-R/GFD-R.023.pdf) EGEE monitoring — X. Jeannin — TNLC Feb. 2009
EGEE constraints Constraints from Grid Sites: • A lightweight product • A well known product • Less invasive as possible: • Minimal continuous measurement • An access to the measure and to the tools • A simple dedicated box (PC) should be enough • Monitoring box deployed for each scientist project Constraints from the project: • Sustainability of the software • Not develop a handmade tool but rely on a set of well known tools EGEE monitoring — X. Jeannin — TNLC Feb. 2009
What the monitoring will be used for? Check e2e link state • Troubleshooting for difficult problem Identify bottleneck Identify the needs of QOS and check its effect Check on SLAs Ensure the network is working properly for the Grid Application performance and application requirement Provide feedback to project: • Link with grid operation • Link with grid management Not to provide input to the Grid workflow in first place • The Grid scheduler (WMS) EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Metrics: first thought (1) • Metrics,to be investigated (first assessment of what will the metric be used for and by who) • OWD (One Way Delay) (NOC/GOC real time application) • Accuracy, NTP • RTT (Round Trip Time) (NOC/gLite/GOC estimation of OWD? Used for TCP ) • Simple • IPDV (IP Packet Delay Variation) (NOC real time(voice), identification of network load) • Packet Loss (NOC/gLite/ GOC reliability; stability; TCP performance; voice and video do not bear more than 1%) • TCP • Capacity (NOC/ GOC topology) EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Metrics: first thought (2) • Available bandwidth (NOC /gLite/ GOC bottleneck identification, NOC can also need available bandwidth by hop; SLA; tools: PcharIperf? frequency?) • Achievable bandwidth (gLite/ GOC reliability, stability, TCP performance, SLA) • Difficult to measure • PMTU (Path Maximum Transmission Unit) ( NOC/GOC trouble shouting large packet size transit problem; segmentation pb) • SNMP: packets in/out, dropped frames, errors • Data access/Data storage • The measurements MUST be available for both IPv4 and IPv6 protocols. • Access to these metrics should be done from a single point abstracting all different network domains and technologies involved. EGEE monitoring — X. Jeannin — TNLC Feb. 2009
OPEN problems • E2E measure versus aggregation metrics within network provider (GEANT DJ1.2.3) • Use NRENs or GEANT metrics for site not monitored • Size of infrastructure to be monitored: • Reduce the scale of the measurements: Virtual Organization by Virtual Organization • What frequency of measurements is required? • E2E measure correlate with aggregation metrics • Automate correlation with network provider • Data storage • Being able to compare with a reference state • PB: Monitoring box deployed in end site for each scientist project EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Approach (1) • Specify in details the requirements • Identify the requirements of Grid Operation Center (GOC) • Identify the requirements of the applications • Check if there are any changes since EGEE-II/SA2 deliverable • SLA • Establish a list of relevant metrics • According to EGEE constraints and specific characteristics (type of traffic), used result of RedIRIS study • limit metrics to the bare necessities • Time accuracy • NTP • Correlation with other sources • GEANT/NREN input • Use performance measures from application / Grid observatory EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Approach (2) • Define the organization of the monitoring • Data storage organization • Scheduling/Synchronization • Frequency of the measure • Type of measure according different constraints (T2/T3) • Establish a list of tools that can • Cover the needs • Be easily deployed • Sketch a solution, testbed if possible EGEE monitoring — X. Jeannin — TNLC Feb. 2009
NRENs participation • Take part and review monitoring design thanks to their operational experience • The NREN involvement is necessary and will be provided an important added-value • Investigate how NREN can provide network metrics to the ENOC for correlation • Indentify which metrics could be useful • Data exchange format and tools • Privacy issue • Sketch a solution if possible in a second step? EGEE monitoring — X. Jeannin — TNLC Feb. 2009
Next step First collect feedback from applications (users), GOC, sites (SA1), NREN, Monitoring expert, SA2. • Specify in details EGEE requirements NRENs investigate • If they could take part into the working group • What kind of metrics can they share with the NRENs • How these metrics could be shared Setting-up the working group First meeting to define the working plan (End of April or May) • Topics to be investigated • Sharing out the work • Working plan EGEE monitoring — X. Jeannin — TNLC Feb. 2009