1 / 31

The State of Networking in OSG

This article discusses the state of networking in the Open Science Grid (OSG) and the challenges faced in maintaining and optimizing networks for distributed infrastructures and scientific research. It also highlights the efforts of OSG in monitoring, managing, and diagnosing network problems for the benefit of its users.

maureeng
Download Presentation

The State of Networking in OSG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The State of Networking in OSG March 8, 2017 Shawn McKee

  2. Networks Supporting Science • While some of us are interested in (or worried about!) networks it is fair to say most scientists would rather not have to think about them. • Ideally networks are “transparent” and always do the right thing, allowing data to move as fast as possible, from anywhere to anywhere at anytime  • The challenge is twofold: • Networks underlie all our distributed infrastructures and must work well for us to use our grids, clouds and HPC resources. • Problems in the network can be very hard to identify, isolate and fix • OSG is working to better monitor, manage and diagnose our networks for all our benefit. OSG AHM

  3. OSG Networking Area Mission • OSG Networking was added at the beginning of OSG’s second 5-year period in 2012 • The “Mission” is to have OSG become the network service data source for its constituents • Information about network performance, bottlenecks and problems should be easily available. • Should support our VOs, users and site-admins to find network problems and bottlenecks. • Provide network metrics to higher level services so they can make informed decisions about their use of the network (Which sources, destinations for jobs or data are most effective?) • The GOAL: to make the most out of the bandwidth (network) we have! How? OSG AHM

  4. Components of OSG Networking • Network Monitoring via perfSONAR • Having perfSONAR fully deployed with a global dashboard is giving us powerful options for better management and use of our network • A network datastore host all network metrics • Tools to manage and maintain our infrastructure • A Modular dashboard (MaDDash); critical for “visibility” into networks. We can’t manage/fix/respond-to problems if we can’t “see” them. • OMD/Check_mk (used to monitor and verify the state of many globally distributed perfSONAR services); required to maintain the overall proper functioning of the monitoring infrastructure. • The development of the “mesh-configuration” and corresponding GUI interface; critical to creating a scalable, manageable deployment for WLCG/OSG • Documentation --- Installation, debugging, How-tos • Outreach and Support • With the network R&E community, VOs, software developers • OSG Support provides network ticket triage and routing OSG AHM

  5. A Global Monitoring Infrastructure • We enabled a global deployment of perfSONARto instrument our networks (both IPv4 & IPv6) 278 perfSONAR instances registered in GOCDB/OIM 248 Active perfSONAR instances 208 Running latest version (3.5+) http://grid-monitoring.cern.ch/perfsonar_report.txt for stats (updated daily) OSG AHM

  6. OSG Network Datastore • A critical component is the datastore to organize and store the network metrics and associated metadata • OSGis gathering relevant metrics from the complete set of OSG and WLCGperfSONAR instances • We have successfully collected more than 1 year of comprehensive network data • Recent updates: • Doubling our storage with disk upgrades • Supports gathering and publishing all types of perfSONAR metrics perfSONAR v3.5 Sites OSG AHM

  7. Monitoring Metrics • Use MaDDashto view metric summaries • Provide quick view about how networks are working • OSG hosts a production instance at: http://psmad.grid.iu.edu/maddash-webui/ • Metrics are displayed via source-destination matrix • Multiple dashboards (meshes) can be selected • Custom menus link to relevant resources • New release (2.0) incorporates MadAlert http://maddash.aglt2.org/madalert.html OSG AHM

  8. OMD/Check_MK Service Monitoring We are using OMD & Check_MK to monitor our perfSONAR hosts and services. Provides useful overview of status/problems https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ [Requires x509 in your browser; update for InCommonauthentication planned] OSG AHM

  9. Detailed Service Checks OSG AHM

  10. Managing perfSONAR Deployments • OSG originally developed a “mesh-config” GUI built within the OIM/MyOSGframework • We provided a GUI to define and organize the regularly scheduled tests between specific sets of perfSONAR instances. • The mesh-config was a huge benefit; no longer need to use email to hundreds of system admins to make changes to network tests and their organization. The GUI made changes easy and consistent. • Problem: not able to be made easily available to others within or outside OSG. • Campuses deploying many perfSONARS • Science VOs wanting to organize/customize their perfSONARs • Solution: Soichi Hayashi produced a draft of a standalone package which provides an even more feature-rich mesh-configuration GUI • However Soichi left OSG • Draft (beta) version had need of a number of fixes and enhancements to meet our needs OSG AHM

  11. Standalone Mesh-config (MCA) • Soichi was pulled back into efforts to finish up the MCA (Mesh-Configuration Admin) GUI • 20% effort from December-March, 10% April-May funded by perfSONAR/IU • Documentation at http://docs.perfsonar.net/mca and at https://github.com/soichih/meshconfig-admin • Issues tracked at https://github.com/soichih/meshconfig-admin/issues (14 open, 8 closed) • OSG instance running at https://meshconfig-itb.grid.iu.edu/ (create an account to play with this) • Now 258 hosts imported from OIM/GOCDB • New API available https://meshconfig-itb.grid.iu.edu/apidoc/ • Close to being ready to use for production Shawn McKee - OSG Networking

  12. MCA Screenshot Shawn McKee - OSG Networking

  13. Primary Features of MCA • Gathers and organizes information on hosts from a combination of sources • Can import perfSONAR global lookup service entries • Able to gather information from registration databases like GOCDB and OIM • Auto-completion / entry of values • Context dependent user interface (see Testspecs) • Can be easily installed outside of OSG • Provides a RESTful interface to allow easy monitoring and software-controlled config • https://meshconfig-itb.grid.iu.edu/apidoc/ • Supports filtering and dynamic host groups • Can now build dynamic meshes, e.g., all CentOS6 hosts who are members of the ATLAS community • Able to support both perfSONAR 3.x and 4.x OSG AHM

  14. MCA Dynamic HostGroup OSG AHM

  15. Enabling Alarming • We have a longer term goal of alerting and alarming on network issues. • Milestone completed: technical design of a suitable analysis system based upon existing time-series technologies • Current operating implementation gathers all perfSONAR data OSG sends to CERN and puts it in ElasticSearch. • Jupyter instance regularly runs cron tasks to analyze data • Near-term goal: anyone can subscribe to simple alert-emails. • Not “production” yet; needs further interface tweaking to make it easy for users to use Shawn McKee - OSG Networking

  16. Prototype Alarm/Alert System Shawn McKee - OSG Networking

  17. Examples of Network Analytics • Using the ELK setup from IlijaVukotic/UC we can look at some of the network data results • This link shows the last 6 months of significant packet loss results by source/destination: http://tiny.cc/PktLossNoUnknown Shawn McKee - OSG Networking

  18. We can analyze in the context of a specific site or link. Example http://tiny.cc/pSLink Shawn McKee - OSG Networking

  19. This example shows measurements being made and captured by OSG http://tiny.cc/pSDash Shawn McKee - OSG Networking

  20. perfSONAR v4.0 / MaDDash 2.0 • The perfSONAR v4.0 release was delayed from the nominal Dec 1 2016 date • Needed an RC3 release to follow-up on more issues found in RC2…still not out yet  • Targeting ~March, 2017 for RC3 • Release in April?!? • MaDDash 2.0 is close to ready. • Once these are released we will want to update ITB and then Production • Will need a global campaign to get sites updated • New versions fix a number of resiliency issues Shawn McKee - OSG Networking

  21. Near-term Plans and Future Work • Updates to our network services • perfSONAR v4.0 • MaDDashv2.0 • MCA into production and available for download • Debug, tune and optimize our set of network related alarms • Enable user-subscriptions to alerting based upon specific criteria • Development of new ways to use our networking metrics • For identification of network problems • To support problem diagnosis and localization • Improving user interfaces for network data exploration and use • Exploration of SDN (Software Defined Networking) capabilities as they become production ready • Integration of network data with higher level services Shawn McKee - OSG Networking

  22. Summary • We have made significant strides in making the network “visible” and easier to diagnose • ~250 OSG/WLCG perfSONAR deployments globally • All monitored, managed and orchestrated by OSG • Tools to manage and maintain our infrastructure are in place • We have production datastore providing long term access to all metrics • New opportunities to improve our networks and their use are possible because of the unique set of data we have • Exploiting the rich dataset we have is underway. • To automate identification of network problems and provide “targeted alerting” on them is a high priority. • To inform and enable higher level services, researchersand users OSG AHM

  23. Questions or Comments? Thanks! OSG AHM

  24. Extra Slides OSG AHM

  25. Recent Talks and Papers (Outreach) • There are at least 5 CHEP 2016 submitted referencing OSG Networking • The OSG Network Service • Scaling the PuNDIT project for wide area deployments • Networks in ATLAS • Networking – the view from HEP (Plenary) • Big Data Analytics Tools as Applied to ATLAS Event Data • January 10th was the Pre-GDB on Networking at CERN • OSG’s role in gathering network metrics and alerting and alarming are central items for near-term work • The 2017 ICFA-SCIC Network Monitoring report was submitted http://icfa-scic.web.cern.ch/ICFA-SCIC/meetings.html • Upcoming presentation on Networking • OSG AHM March 8 • ATLAS Software and Computing week March 13-17th Shawn McKee - OSG Networking

  26. URLs of Relevance • OSG Network Datastore Documents • Operations https://docs.google.com/document/d/1l144BSo-88M0cLMMjKcKMIE-Q5s21X-w3lYl-0Pn_08/edit# • SLA https://twiki.grid.iu.edu/bin/view/Operations/PSServiceLevelAgreement • Data lifecycle https://docs.google.com/document/d/1mJ1kf43nZf6gvKoNtiTOc0g0MYDv_wSfSm7YdiMs3Lo/edit# • Current OSG network documentation https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG • OSG networking year-5 goals and milestones: https://docs.google.com/document/d/1FzmXZinO4Pb8NAfd5SWUzaAFYOL23dt66hQsDmaP-WI/edit • perfSONARadoption tracking: http://grid-monitoring.cern.ch/perfsonar_coverage.txt • Deployment documentation for both OSG and WLCG hosted in OSG (migrated from CERN) https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR • ATLAS Analytics: • Packet-loss: http://tiny.cc/PktLossNoUnknown (6 month view) • perfSONAR dashboard: http://tiny.cc/pSDash • perfSONARlink details: http://tiny.cc/pSLink • Mesh-config in OSG https://oim.grid.iu.edu/oim/meshconfig • Pre-Production Meshconfighttps://meshconfig-itb.grid.iu.edu/meshconfig/ • MadAlert: http://madalert.aglt2.org/madalert/diff.html • perfSONARhomepage: http://www.perfsonar.net/ Shawn McKee - OSG Networking

  27. Status of OSG Net Services Config Changes Shawn McKee - OSG Networking

  28. OSG Networking and End-to-end • Most scientists just care about the end-to-endresults: • How well does their infrastructure support them in doing their science? • Network metrics allow OSG to differentiate end-site issues from network issues. • There is an opportunity to do this better by having access to end-to-endmetrics to compare & contrast with network-specific metrics. • What end-to-end data can OSG regularly collect for such a purpose? • Is there some kind of common instrumentation that can be added to some data-transfer tools? (NetLogger in GridFTP, having transfers "report" results to the nearest perfSONAR-PS instance?, etc) • Let’s try to put all the information we have together… OSG AHM

  29. Understanding Network Topology • One thing that may not be obvious is that all the network measurements are only really useful if we know the path being measured…topology is very important to find/fix problems. • OSG measures topology via traceroute and tracepath • Can we create tools to manipulate, visualize, compare and analyzenetwork topologiesfrom the OSG network datastore contents? • Can we build upon these tools to create a set of next-generation network diagnostic tools to make debugging network problems easier, quicker and more accurate? • Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. • This area is under active investigation in various projects. Lots of work to do here. OSG AHM

  30. Exploring Path Analysis We can correlate paths with packet-loss/latency information (PuNDIT) We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) GEANT Aachen RAL DFN JANET QMUL ITEP latency, packet-loss, throughput all available OSG AHM

  31. Software Defined Networks and OSG • Within the next few years evolving technology in the area of Software Defined Networking(SDN) may be able to provide OSG researchers with the ability to construct their own Wide-Area networks with specified characteristics. • What will OSG be able to do to integrate this type of capability with the rest of the OSG infrastructure? • We are planning for how best to enable evolving capabilities in the network for OSG users and admins. We need to address: • What is the impact on the OSG software stack? • What strategic modifications/additions/tools are useful? OSG AHM

More Related