perfSONAR use at US LHC Facilities

March 9th 2010, LHCOPN Eric Boyd, Deputy Technology Officer perfSONAR use at US LHC Facilities

Outline • Problem Statement: Distributed monitoring of US based LHC sites. • Our Solution: pS Performance Toolkit • User Expectations and Use Cases • Challenges • Supporting the infrastructure • Specifics • Operational recommendations • Measurement Best Common Practices • Success Stories • Future Directions

Problem Statement • US based LHC sites (e.g. US ATLAS Tier2s) wanted a monitoring solution to ensure peak performance between each other and the Tier 1 (BNL). • Connectivity ranges by location (e.g. ESnet, Internet2, NLR, Ultralight, RENs + need to worry about campus infrastructure) • Simple requirements: • ‘Available Bandwidth’ testing – scheduled and on demand • ‘Latency’ testing – scheduled and on demand • ‘Passive’ monitoring (e.g. SNMP) • Installation, configuration, and maintenance should be minimal • Homogenous hardware and software – eliminate systematic errors in the results by keeping the platform the same. • Solution should be nimble enough to adapt at other facilities (e.g. Tier 3s, US CMS, integration with MDM)

Possible Solutions – ‘Composite’ Approach • Recommend a suite of tools to install, provide guides/workshops • Would require dedicated hardware – configuration and management would be up to local sites • Could recommend a platform • Could rely on local staff to choose a machine that meets criteria • Edge Solution vs. Network Core • Exchange Points, Backbone, Regional, and Campus networks are working to make perfSONAR data within the network core • Edge facilities, e.g. where the science data lives and is processed, can get involved in the same manner

Possible Solutions – ‘Composite’ Approach • Positive Features: • Relative cost is low – existing hardware can be used to run most measurement tools • Integration with existing tools on other networks • Drawbacks: • Labor intensive for local staff to maintain system • Hardware may differ from site to site • Development team may spend a lot of time in a ‘support’ role (e.g. getting the tools installed and configured).

Possible Solutions – ‘Appliance’ Approach • Prepare the entire environment for the target use • Uniform system makeup/hardware/versions • Pre-installed and pre-configured • Centrally Managed Solution: • Central facility monitors the health and updates of the framework • Support available for software and hardware • Locally Managed Solution • Local institutions monitor daily activities • Software support (e.g. configuration, bug and security patches) is available • Regular updates anticipated to address bugs/enhancements • Edge Solution vs. Network Core • As in the ‘Composite’ approach, the edges can still participate at a protocol level with all perfSONAR products

Possible Solutions – ‘Appliance’ Approach • Positive Features: • Integration with existing tools on other networks • Homogonous software and hardware • Easy maintenance and upgrade path • Drawbacks: • Costs associated with management • Hardware support • Potentially contracts to manage the software functionality and operation

Solution – pS Performance Toolkit (pSPT) • In short this is a Locally Managed Appliance • The pSPT is a bootable CD • Contains all necessary software in a single package • Wizard interface to configure aspects of the system • Upgrade path is simple: burn a new CD and reboot! • Hardware is similar at all USATLAS locations, everyone has 2: • ‘KOI’ 1U Server • Pentium 2.2 GHz, Dual Core • 2G RAM • 160GB Hard Drive • Daily operations (e.g. system maintenance and monitoring) done by the local facility • Software support (e.g. updates, interim bug fixes, mailing list for questions) provided by development team

Solution – User Expectations • Installation • Must be ‘easy’. Given the variability in what is easy for a Sysadmin vs. a Physicist vs. an Administrator: we aimed very low. • Burning a CD and rebooting a machine is as simple as it gets • Configuration • Also must be ‘easy’. Step by step instructions to guide the user through the process of configuring the system and tests. • Ability for power users to skip the guided approach • Status and feedback on the process • Operation • System should work without human intervention. • Reboots or system halts should not result in a loss of data or configuration. Resume operation when back up and running • System should alert when in distress

Solution – User Expectations – cont. • Maintenance • Once again: ‘easy’. Most maintenance tasks, e.g. checking the disk and software, can be automated. • Integration into alert systems (Nagios) is in progress • Upgrades should be the same as installation. • Data Use • Way to access collected data – either through GUIs or web services • Easy interpretation of results, e.g. make sure all the measurement we are doing is actually useful (!) • Support • Security patches and bug fixes must be made available in a timely manner • Method must exist to ask questions on installation, configuration, upkeep • Community (e.g. US ATLAS) can self support along with help from the development team over time.

Solution – Use Cases • US ATLAS (Scientific VO) Use Case: • 2 Servers per facility • Bandwidth Testing • Latency Testing (sensitive – isolated from other measurements) • Configuration is one time (initial) • Configure system information (network settings, location) • GUIs to guide regular test set up • State and measurement data saved on local disk • Maintenance consists of examining data, and upgrading CD when required • Testing is designed to occur without intervention • Data consumption • On-board GUIs to visualize the results • Built on perfSONAR platform – data can be easily shared (and located in other locations) to construct new GUIs.

Solution – Use Cases – cont. • Regional/Campus Network Use Case: • Simple deployment within the core and edges of the network • Integrates into perfSONAR deployment on a backbone or exchange point • General Diagnostic Use Case: • Instant availability of a testing point anywhere in the network • Will not harm the operating system of a non-dedicated resource (e.g. 1 time use) • Remote Facility Use Case: • Non-technical staff can easily deploy for diagnostic purposes • Interval and magnitude of testing can be adjusted to account for network availability

Solution – Supporting the Infrastructure • perfSONAR-PS Development Team • ESnet, Fermilab, Indiana University, Internet2, SLAC, University of Delaware • Collaboration with perfSONAR-MDM to ensure protocol compatibility of all software and services • US ATLAS Support: • Regular release schedule (~4 per year), on demand releases if something goes wrong • Alerts on vulnerabilities, patches made in a timely manner • Feedback mechanism for bug reports and enhancements • Mailing list for discussion • Maintained by developers • Encourages building a community to answer questions and solve non-software related problems

Solution Details – Operations • Installation of 2 Hosts per facility (Tier 2s) • Latency and Bandwidth Hosts • Position near the ‘Edge’ of the facility • Optional: add another host near the storage/compute nodes • Installation of 1 ‘large’ host (Tier 3s, RENs) • Position near the edge • Can run both bandwidth and latency tests, but results may be tainted • ‘Edge’ Institutional deployment options: • Border is good for testing connectivity to outside world and path decomposition for problem diagnosis • Co-located with compute/storage is good for testing what the application will see (e.g. if they travel through a firewall, etc.)

Solution Details – Measurement BCP • Bandwidth • TCP tests every 4 hours, 20 seconds in length. Test to all Tier2s and the Tier1 • Latency • One Way – Constant stream of 10 packets per second. Test to and from all Tier2s and the corresponding Tier1 • Round Trip – 10 Packets every 5 minutes. Test to all Tier2s and the Tier1 • Passive Monitoring • No official stance – interest in making border router data available as well as any links of interest • Currently do not use circuit status monitoring (e.g. E2Emon)

Success Stories - Ultralight • University of Michigan to BNL • Poor performance for a single direction • Traversed 5 networks (UofM, Ultralight, Internet2, ESnet, BNL) • perfSONAR available on all parts of the path (e.g. demarcation points). Simplified due to infrastructure being in place • Process: • Test from BNL to each intermediate point • Test from UofM to each intermediate point • Isolated the problem to a single section of the Ultralight network • Once we know where to look, we had to figure out what to do: • Physical Infrastructure – no damage to infrastructure found. Cleaning performed for good measure. • Hardware – line cards properly seated. No errors found Software – Router operating systems up to date? Any unchecked alarms?

Success Stories – Ultralight – cont. • Soft Failure: • A fault or situation that doesn’t cause loss of connectivity, but will impact performance • May go unnoticed for long periods of time • May impact select set of users • Ultralight switch was flooded with a global routing table from a peer – caused an unchecked warning flag in configuration • Limited buffer sizes – even though switch was configured to have these be large • Performance tools (NDT) noticed the discrepancy • The fix is to to upgrade software and reboot • Publication: http://www.internet2.edu/performance/200904-CS-UL.pdf

Success Stories - REDDnet • REDDnet (e.g. US CMS Tier3 Vanderbilt University) • Distributed data storage – equipment co-allocated at LHC schools and positioned near other resources (compute resources, core infrastructure) • Observed that between may facilities, the data transfer activities was taking much longer than expected (orders of magnitude slower) • Solution was two parts: • Diagnostics: • Install tools to get a base line of performance • Helped to identify where effort should be spend first • Regular monitoring: • Regular bandwidth and latency testing • Establish patterns – e.g. is congestion heavily influencing the performance or is something wrong in design

Success Stories – REDDnet – Cont. • Problem Breakdown: • Campus network design • Most facilities featured firewalls where scientific traffic was treated the same as enterprise traffic • Performance tools were able to spot excessive queuing and dropped traffic • Death sentence for large data transfers – explaining the situation with campus administrators cleared this up immediately • Hardware limitations • Local administrators assigned non-capable unmanaged switches on occasion • Performance tools were able to spot buffering bottlenecks immediately • Replacing hardware is best solution – configuring settings where applicable will work also

Success Stories – REDDnet – Cont. • Problem Breakdown (cont.): • Unchecked physical infrastructure errors • Latency tools detected a steady stream of loss on a given link • Network staff on a downstream network were contacted to view passive monitoring data (it was not available through perfSONAR) • CRC errors found on a dirty fiber in the demarcation of the networks • Hardware Failure • Bursty loss observed on a given link, tools were able to isolate to a single device. • Observing the device over a 2 day period showed that processing load would spike every couple of minutes. • Device had two power units (primary and backup). The secondary unit was not completely plugged into the wall – the power management software was constantly flapping and effecting the routing ability

Success Stories – REDDnet – Cont. • Problem Breakdown (cont.): • Host Tuning: • Storage and compute nodes were not performing as well as monitoring units predicted • Same old story: TCP settings on resources were not appropriate for the job at hand. • See also: http://fasterdata.es.net/TCP-tuning/background.html • Epilogue: • Regular monitoring now in place – Nagios configured to give alarms when expected performance drops below a threshold • Data transfers working at expected levels – REDDnet is now prepared for LHC turn on.

Future Directions • Future Directions • Further pSPT Development • Expanding to other scientific communities • Circuit Monitoring • Upcoming event: • NSF CISE (and OCI) sponsoring a workshop, Summer 2010 • Bring together researchers and R&E network operators to spread perfSONAR deployment and see how to create some lasting relationships between researchers and R&E operators in order to create conduits

Future Directions – pSPT Enhancement • Common complaints from users: • GUIs focused around user diagnosis, not specific tool metric display • Need more guidance in what must be vs.. what could be configured in the wizard GUIs • Want to be able to install directly to a host (eliminate CD) • Support for a wider range of 10G hardware (e.g. development team has limited testing access currently) • Tighter integration of tooling – e.g. coordinate the latency testing with bandwidth testing to not overlap measurements. • Integration with Nagios: • Process monitoring and alerting • Data monitoring (e.g. expected value drops below a threshold) • Integration into logging infrastructure (syslog-ng) • Integration of circuit (static and dynamic) monitoring

Future Directions – pSPT Enhancement • Roadmap for 3.2 series (Late Summer of 2010): • Migration to Red Hat/CentOS Live CD platform • Mirrors software infrastructure of most LHC facilities • Re-design of Wizard interfaces • Nagios/Logging Upgrades • Circuit Monitoring Integration • Testing on a wider variety of hardware

Future Directions – Other VOs • Currently working with other VOs that anticipate performance monitoring needs: • LSST – Telescopes • NEES – Earthquake simulation • Other Physics Communities (Daya Bay, LIGO) • Overall concept of the pSPT won’t change • Other VOs have different operational requirements and capabilities • Specific aspects of performance may matter more (e.g. stability vs.. raw bandwidth)

Future Directions – Circuit Monitoring • Circuit Monitoring is extremely important, both in terms of static links and dynamic circuit networks. • Recent Demonstrations in GLIF focused on operational aspects (difficulty of setup, what can be shown): • Multi-domain circuit monitoring (Fall 2008) • Granularity of circuits, e.g. identifying domain specific components via information services (Fall 2009) • Recent work in the OGF: • Standardizing the methods used to name and locate circuits and segments • Push to define dynamic circuit architecture and protocols

Future Directions – Circuit Monitoring • Desirable goals: • Define a succinct system that meets the needs of both worlds • Methods to share information with related systems, e.g. circuit identification must be tied to monitoring status and performance • perfSONAR-PS consortium is looking these problems currently: • Desire to integrate into dynamic circuit (e.g. IDC protocol) operations • Desire to distribute functionality on a future release of the pSPT

Conclusion • Different ways to approach monitoring a loosely coupled VO • Approach is dictated by resources • Scalability is a large factor • Not only of software/hardware architecture but of human resources available for installation, testing, and maintenance as well • Development will address the needs of the community and advance the usefulness of the tools • Accepting feedback to make the product better • Offering limited but clearly stated support • Educating the users is a priority – a high amount of confidence in a tool comes from comfort • Questions?

perfSONAR use at US LHC Facilities March 9th 2010, LHCOPN Eric Boyd, Deputy Technology Officer For more information, visit http://psps.perfsonar.net/toolkit

perfSONAR use at US LHC Facilities

perfSONAR use at US LHC Facilities

Presentation Transcript

US LHC Media Training

use of facilities

PerfSONAR deployment

SUSY at LHC

perfSONAR

perfSONAR Overview

perfSONAR deployment

US - LHC

Supersymmetry at LHC

ALICE at LHC

perfSONAR Use Cases

perfSONAR Use Cases

Calorimetry at LHC

US LHC Web site

perfSONAR Overview

US LHC Media Training

US-LHC Awareness Proposal and US-LHC Communications Task Force

Physics at LHC

perfSONAR Overview

Jets at LHC