240 likes | 254 Views
This update provides an overview of the status, tools, and plans for perfSONAR Monitoring of LHCONE/LHCOPN. Issues and changes are discussed, along with updates on the MaDDash and OMD monitoring pages. The WLCG Network and Transfers Metrics WG's monitoring of perfSONAR instances is also mentioned.
E N D
perfSONAR for LHCOPN/LHCONE Update Shawn McKee/University of Michigan LHCONE/LHCOPN Meeting LBNL/Berkeley, CA June 1st, 2015
Overview of Talk • Overview of Status (Changes, Issues) • Tools and plans forperfSONAR Monitoring ofLHCONE/LHCOPN • Discussion LHCONE/OPN-Berkeley-Shawn McKee
LHCONE MaDDash – 09 Feb 2015 We had a couple hosts with issues: ps01-nl.geant.nl (called perfSONAR-latency) and the Internet2 host at ManLan (called Internet2 perfSONAR) both show issues. NOTE: labels generated from Mesh registration information LHCONE/OPN-Berkeley-Shawn McKee
LHCONE MaDDash – 01 Jun 2015 We are looking much better. Residual “orange” still needs some investigating. A couple instances are not upgraded to 3.4.2. Some hosts may be underpowered. NOTES: settings in MaDDash are still default and not yet settable per row/cell. LHCONE/OPN-Berkeley-Shawn McKee
LHCOPN MaDDash – 09 Feb 2015 We hada couple hosts with issues: Kisti had firewall issues: updated today, Still LOTS of orange on latency mesh BW mesh much better but “red” throughput is worth examining NOTE: labels generated from Mesh registration information LHCONE/OPN-Berkeley-Shawn McKee
LHCOPN MaDDash – 01 Jun 2015 Latency much better but still some issues: RAL shows signs of continuing network problems. BW mesh similar. Kisti still has problems,“red” throughput is worth examining. LHCONE/OPN-Berkeley-Shawn McKee
perfSONAR Coverage Monitoring • The WLCG Network and Transfers Metrics WG monitors the number of registered WLCG/OSG instances of perfSONAR and update daily https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics • Versions are tracked and summaries of the problems are listed LHCONE/OPN-Berkeley-Shawn McKee
OSG Network Service • Open Science Grid (OSG) has deployed a network service for WLCG (and LHCONE). It consists of: • A datastore based upon Esmond (new MA in perfSONAR v3.4) • A GUI using MaDDash • A service monitoring component built on OMD • A “mesh-creation-configuration” utility built on registered information in OIM and GOCDB • Status: • Datastore is “almost” production (targeting July) • MaDDash and OMD working well. • Mesh-config: plans are to package to allow others to use. LHCONE/OPN-Berkeley-Shawn McKee
OSG Network Datastore • A critical component is the datastore to organize and store the network metrics and associated metadata • OSG is gathering relevant metrics from the complete set of OSG and WLCG perfSONAR instances • This data will be available via an API, must be visualized and must be organized to provide the “OSG Networking Service” • Operating now • Targeting a production service by end of July LHCONE/OPN-Berkeley-Shawn McKee
perfSONAR Monitoring Pages • We have 3 versions of our perfSONAR monitoring pages • Prototype at maddash.aglt2.org • Testing at OSG’s ITB instance • Production at OSG’s production instance • Main monitoring types are MaDDash and OMD/Check_MK • Prototype: http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk • Testing: http://perfsonar-itb.grid.iu.edu/maddash-webui/ https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk/ • Production: http://pfmad.grid.iu.edu/maddash-webui/ https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk • Notes: • OSG instances rely upon OSG Datastore: http://pfds.grid.iu.edu • X509 cert needed to view check_mk/OMD pages (any IGTF cert) LHCONE/OPN-Berkeley-Shawn McKee
OMD for LHCONE/LHCOPN perfSONARs https://maddash.aglt2.org/WLCGperfSONAR/check_mk • Since February we have added new tests • “Expected” test coverage • NDT/NPAD running? • Memory on hosts (<4GB) • New “version” test • Access requires x509 credential from IGTF CA • Gives us a good view into where problems still exist LHCONE/OPN-Berkeley-Shawn McKee
OMD Hostgroup Summary LHCOPN/LHCONE LHCONE/OPN-Berkeley-Shawn McKee
WLCG-wide Mesh • The Network & Transfer Metrics WG is focusing on the top WLCG sites (by storage) in a new WLCG mesh • Plan is to create a full mesh of latency and bandwidth tests • We will turn down other meshes while we debug • BW tests for 30 seconds 1/week between ALL sites • Currently the mesh is comprised of the top 43 sites • We will add sites until we hit issues • Ultimately we will restore regional/VO meshes based upon the use-case needs we have gathered from the VOs • What about our meshes? • LHCOPN Mesh should stay active? • LHCONE Mesh should stay active? (Let’s discuss this) LHCONE/OPN-Berkeley-Shawn McKee
WLCG Latency Mesh LHCONE/OPN-Berkeley-Shawn McKee
Other Changes • IPv6 use increasing. Dual-stack testing mesh in place • Since our last meeting we have changed the default tests for bandwidth and traceroutein the mesh-config • Iperf -> Iperf3 (provides better BW estimation) • traceroute -> tracepath (provides MTU details) • The traceroute to tracepath change caused problems in our coverage. • Prior to change we had 83% of the theoretical number of paths covered and only 53% after • Debugging with Andy Lake/ESnet showed the problem seemed to be firewall related in many cases • We will likely change topology testing to revert the default back to traceroute and ADD a new (low frequency) tracepath test • As I will discuss next, topology is VERY important LHCONE/OPN-Berkeley-Shawn McKee
Future Work: Using Our Data Host A is getting poor performance to Host B and seeing 3% packet loss Normally we would start to investigate partial paths to isolate the problem However we also see Host D to Host C is having problems and 2% packet loss: And there is a third pair (Hosts E and F) having 1% packet loss: Let’s correlate these paths Host B 835 613 772 481 016 Host A 128 Host D Host C 340 746 592 907 613 481 Host F 419 109 079 481 772 Host E LHCONE/OPN-Berkeley-Shawn McKee
Topology View Solution: 2% loss from 613-481 1% loss from 481-772 Contact these link owners! Host C Host E 592 419 746 835 613 772 481 Host B 016 128 Host A 109 907 340 079 Host D Host F LHCONE/OPN-Berkeley-Shawn McKee
Understanding Network Topology • Can we create tools to manipulate, visualize, compare and analyzenetwork topologiesfrom the OSG network datastore contents? • Can we build upon these tools to create a set of next-generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? • Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. LHCONE/OPN-Berkeley-Shawn McKee
Possible Tool - Graph Databases: Neo4j LHCONE/OPN-Berkeley-Shawn McKee
New WLCG Support Unit • We have established a new GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) used to report incidents (mailing list: wlcg-network-throughput at cern.ch) • Experiments can report potential network performance incidents. • WLCG perfSONAR support investigates and confirms if this is network related issue. • Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page. • Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider. • If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination. • https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents. • LHCOPN/LHCONE experts are very important in this coordinated activity. LHCONE/OPN-Berkeley-Shawn McKee
Next Steps • We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured • We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix. • Some hosts are underpowered (<4GB in latency) or broken • New perfSONAR version 3.5 out by end of summer along with MaDDash 2.0. • There are some bugs we know of in the data acquisition chain that need fixing. Ongoing effort on this • As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics. LHCONE/OPN-Berkeley-Shawn McKee
Discussion/Questions/Comments? LHCONE/OPN-Berkeley-Shawn McKee
Discussion Topics • What would you like to see as next steps? • Are the tools sufficient to help us with our goals? • Should we keep both the LHCOPN and LHCONE meshes while the WLCG mesh is being developed? • Should we push forward on using our metrics to start addressing issues in the network? (who is “we” ? ) LHCONE/OPN-Berkeley-Shawn McKee
Useful URLs • Open Science Grid Networking URL • https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG • LHCOPN instructions for perfSONAR-PS (needs update): • https://twiki.cern.ch/twiki/bin/view/LHCOPN/PerfsonarPS • MaDDash Monitoring • http://maddash.aglt2.org/maddash-webui/index.cgi?dashboard=LHCONE%20testing%20sites • http://pfmad.grid.iu.edu/maddash-webui/index.cgi?dashboard=LHCONE%20Mesh%20Config • OMD Monitoring • https://maddash.aglt2.org/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhostgroup%26hostgroup%3DLHCONE • https://pfomd.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhostgroup%26hostgroup%3DLHCONE LHCONE/OPN-Berkeley-Shawn McKee