280 likes | 292 Views
This presentation provides an overview of the changes and updates in perfSONAR, as well as the infrastructure overview of WLCG, LHCONE, and LHCOPN. It also discusses the status and changes in the meshes, introduces new tools like ElasticSearch, MadAlert, and topology explorations, and concludes with a summary and discussion.
E N D
LHCOPN/LHCONE perfSONAR Update Ian Collier/RAL Presenting for Shawn McKee/UM LHCONE/LHCOPN Meeting Taipei, Taiwan March 13th, 2016
Overview of Talk • perfSONAR Changes and updates • WLCG, LHCONE and LHCOPN infrastructure overview • Status and changes in our meshes • Some new tools • ElasticSearch, MadAlert and topology explorations for our data • Summary and Discussion LHCONE-Taipei
Importance of LHCONE perfSONAR • As we start this presentation, it is important to note the usefulness of having LHCONE perfSONAR instance in place. • Just within the last 2 months we have used instances in the US and Europe to help diagnose network issues • We see a gap in coverage for Asia and it would be very good to get additional instances in place…especially in the regional R&E networks. • We are hoping this LHCONE/LHCOPN meeting will be a chance to encourage additional instances in Asia to join the LHCONE monitoring mesh. • Contact Shawn McKee and Marian Babik if you are interested! LHCONE-Taipei
perfSONAR v3.5.1 Toolkit • perfSONAR v3.5.1 released on the 4th of March 2016 • Main themes for this release: • A new web interface for creating/managing your regular tests • Normalized package names, configuration files and paths • Upgrade to Esmond (backward incompatibilities for writing data) • Improved support for Debian 7 and 8 • See release notes http://www.perfsonar.net/release-notes/version-3-5-1 • In addition v3.5.1 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness. • WLCG/OSG Deployment status as of today (great progress): • 3.4.1 : 6 • 3.4.2 : 8 • 3.5 :2 • 3.5.0 : 37 • 3.5.1 : 169 • Unknown: 25 (These nodes are either down or hung) LHCONE-Taipei
Review perfSONAR Deployment Options • Configuration managed deployments via bundles (see http://docs.perfsonar.net/install_options.html) • perfSONAR Tools (just tools) • perfSONAR TestPoint (passive, no MA) • perfSONAR Core (+MA) • perfSONAR Complete (+Web and Toolkit Configuration) • perfSONAR Central Management (MaDDash, Auto-config, Centralized config service) • Low-cost nodes to support large-scale deployment (http://docs.perfsonar.net/low_cost_nodes.html ) • $100-200 range should enable broad deployment • Small form factor enables more locations • Some limitations in capabilities due to hardware • VMs - Still not recommended but possible • Target: whole node VMs, VMs with dedicated physical NICs • Main use “end-to-end” infrastructure testing (not network) • What about Docker? • http://www.perfsonar.net/deploy/installation-and-configuration/ LHCONE-Taipei
Map of perfSONAR Deployment http://grid-monitoring.cern.ch/perfsonar_report.txt for stats 278 perfSONAR instances registered in GOCDB/OIM 248 Active perfSONAR instances 208 Running latest version (3.5+) https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3 • Initial deployment coordinated by WLCG perfSONAR TF • Commissioning of the network followed by WLCG Network and Transfer Metrics WG LHCONE-Taipei
Gathering & Storing Metrics • OSG is providing network metric data for its members and WLCG via the Network Datastore • The data is gathered from all WLCG/OSG perfSONAR instances • Stored indefinitely on OSG hardware • Data available via Esmond API • In production since September 14th 2015 • The primary use-cases • Network problem identification and localization • Network-related decision support • Network baseline: set expectations and identify weak points for upgrading LHCONE-Taipei
Review of perfSONAR Pipeline The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. End users can get monitor the data via the OSG MaDDash instance, grab the data directly from the OSG datastore or subscribe to the ActiveMQ bus at CERN LHCONE-Taipei
Configuration for LHCOPN/LHCONE • We have changed to use uni-directional tests for OWAMP to reduce the load • Source host is responsible for initiating and recording test results to each destination • We are using iperf3 as the baseline for bandwidth measurements (adds retry information) • Fall fix for NDT ensured the TCP congestion protocol would use ‘htcp’ rather than ‘reno’ when NDT and NPAD are not in use and improves BW results. • We are sending all the LHCOPN and LHCONE data into ElasticSearch (ongoing) LHCONE-Taipei
Existing Test Coverage • Current perfSONAR measurement coverage for WLCG/OSG: • Full latency (one-direction only, 10Hz, OWAMP, IPv4) • Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) • Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6) • Regional meshes still disabled, need to discuss how to evolve • We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) • We could move from regional to bigger meshes (European, Asia/Pacific, US) • We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes) • We re-enabled project meshes • Belle II – both latency and bandwidth • Dual-stack – just bandwidth (both IPv4 and IPv6) • LHCONE/LHCOPN – These are separately tracked LHCONE-Taipei
perfSONAR Monitoring Pages • We have 3 versions of our perfSONAR monitoring pages • Prototype at maddash.aglt2.org (intending to phase this out soon) • Testing at OSG’s ITB instance • Production at OSG’s production instance • Main monitoring types are MaDDash and OMD/Check_MK • Prototype: http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk • Testing: http://perfsonar-itb.grid.iu.edu/maddash-webui/ https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk/ • Production: http://psmad.grid.iu.edu/maddash-webui/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk • Notes: • OSG instances rely upon OSG Datastore: http://psds.grid.iu.edu • X509 cert needed to view check_mk/OMD pages (any IGTF cert) LHCONE-Taipei
Check_mk for LHCONE/LHCOPN perfSONARs https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ (Prototype) https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ (Production) Access requires x509 credential from IGTF CA Gives us a good view into where problems still exist • We monitor: • “Expected” test coverage • NDT/NPAD running? • Memory on hosts (<4GB) • New “version” test LHCONE-Taipei
Monitoring Metrics • Use MaDDash to view metric summaries • Provide quick view about how networks are working • OSG hosts production instance http://psmad.grid.iu.edu/maddash-webui/ • Metrics are displayed via source-destination matrix • Multiple dashboards (meshes) can be selected • Custom menus link to relevant resources • New release (2.0) will incorporate MadAlert http://maddash.aglt2.org/madalert.html LHCONE-Taipei
Evolution of LHCOPN/LHCONE Monitoring • As usual we will show how the monitoring in MaDDash is changing since the last meeting • We have two known problems with LHCONE instances from GEANT and Internet2 • GEANT instance in Amsterdam was recently upgraded to perfSONAR v3.5.1 BUT there is a problem writing to the updated Esmond • The Internet2 instances are “multi-purpose” and have an MA which uses a different FQDN/IP than the LHCONE measurement interface. The current mesh-config isn’t setup to handle this configuration. • Additionally there may be some problems with these v3.4.1 instances LHCONE-Taipei
LHCONE MaDDash – 27 Oct 2015 Some issues getting data from Internet2/GEANT instances we need to look into LHCONE-Taipei
LHCONE MaDDash – 11 Mar 2016 Things are looking a bit worse. We have known issues with the AMS_GEANT and Internet2 instances that are being worked on. Real issues into IN2P3 as well as problems outbound? Should be investigated. LHCONE-Taipei
LHCOPN MaDDash – 27 Oct 2015 Some firewall problems for the OSG collector from FNAL. Setup being examined at INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things are broken Should be fixed later today. LHCONE-Taipei
LHCOPN MaDDash – 11 Mar 2016 RAL and TRIUMF showing signs of continuing network problems. Latency mesh improved. BW mesh still shows many issues. Kisti still has BW problems. LHCONE-Taipei
Existing Tools • We have a number of tools available to help debug and understand network problems. • There are very good presentations on these tools in the training materials provided by perfSONAR: http://www.perfsonar.net/about/training-materials/ • While I don’t have time to cover all the details (see http://www.perfsonar.net/about/training-materials/201507-ps-training/ and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3rd party tests (between two remote instances) for bandwidth, latency and traceroute. • Follow the debugging strategy as a guide to finding and fixing LHCONE/LHCOPN network issues using perfSONAR capabilities • As for new tools…. LHCONE-Taipei
ATLAS Network Metrics Pipeline • IlijaVukotic, Kaushik De, Rob Gardner and Jorge Batista are working with the Network and Transfer Metrics WG to make perfSONAR metrics available to PANDA • See Ilija’s presentation at http://tinyurl.com/gt92zwb • Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES -> PANDA • Prototype working and analytics being performed in Elastic Search to validate data (see following slide) • Working on a network source-destination cost-matrix PANDA can use to evaluate options • Interface details being discussed with PANDA team • Could also be used to analyze LHCONE/LHCOPN data! LHCONE-Taipei
perfSONAR Data into ElasticSearch Avg src loss % Avg dst loss % http://tinyurl.com/z4dnfs8 for example plots using WLCG data LHCONE-Taipei
MadAlert: A project to analyze meshes • Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert • See details at http://madalert.aglt2.org/madalert/index.html • You can see meshes and reports from the page • Reports find both infrastructure and network problems • We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0) • Now testing a “diff” to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2) • http://madalert.aglt2.org/madalert/testDiff.html • Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. LHCONE-Taipei
Understanding Network Topology • Can we create tools to manipulate, visualize, compare and analyzenetwork topologiesfrom the OSG network datastore contents? • Can we build upon these tools to create a set of next-generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? • Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. • This area is under active investigation in various projects. Lots of work to do here. LHCONE-Taipei
Exploring Path Analysis We can correlate paths with packet-loss/latency information (PuNDIT) We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) GEANT Aachen RAL DFN JANET QMUL ITEP latency, packet-loss, throughput LHCONE-Taipei
WLCG Support Unit • Reminder: We have a GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) used to report incidents (mailing list: wlcg-network-throughput at cern.ch) • Experiments can report potential network performance incidents. • WLCG perfSONAR support investigates and confirms if this is network related issue. • Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page. • Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider. • If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination. • https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents. • LHCOPN/LHCONE experts are very important in this coordinated activity. LHCONE-Taipei
Next Steps • We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured • We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix. • Some hosts are underpowered (<4GB in latency) or broken • As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics. • We need to plan for a campaign to clear up remaining LHCONE/LHCOPN problems. • Currently working on the LHCONE issues we noted previously. • Need more instances in Asia in the regional R&E networks!! LHCONE-Taipei
Discussion/Questions/Comments? LHCONE-Taipei
References • Network Documentation https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG • Deployment documentation for OSG and WLCG hosted in OSG https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR • New MA guide http://software.es.net/esmond/perfsonar_client_rest.html • Modular Dashboard and OMD Prototypes • http://maddash.aglt2.org/maddash-webuihttps://maddash.aglt2.org/WLCGperfSONAR/check_mk • OSG Production instances for OMD, MaDDash and Datastore • http://psmad.grid.iu.edu/maddash-webui/ • https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ • http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json • Mesh-config in OSG https://oim.grid.iu.edu/oim/meshconfig • New mesh config info: http://soichi7.ppa.iu.edu/pdoc/mca.html • Send feedback to Soichi • Use-cases document for experiments and middleware https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1mc/edit LHCONE-Taipei