300 likes | 313 Views
March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer. perfSONAR use at US LHC Facilities. Outline. Problem Statement: Distributed monitoring of US based LHC sites. Our Solution: pS Performance Toolkit User Expectations and Use Cases Challenges Supporting the infrastructure
E N D
March 9th 2010, LHCOPN Eric Boyd, Deputy Technology Officer perfSONAR use at US LHC Facilities
Outline • Problem Statement: Distributed monitoring of US based LHC sites. • Our Solution: pS Performance Toolkit • User Expectations and Use Cases • Challenges • Supporting the infrastructure • Specifics • Operational recommendations • Measurement Best Common Practices • Success Stories • Future Directions
Problem Statement • US based LHC sites (e.g. US ATLAS Tier2s) wanted a monitoring solution to ensure peak performance between each other and the Tier 1 (BNL). • Connectivity ranges by location (e.g. ESnet, Internet2, NLR, Ultralight, RENs + need to worry about campus infrastructure) • Simple requirements: • ‘Available Bandwidth’ testing – scheduled and on demand • ‘Latency’ testing – scheduled and on demand • ‘Passive’ monitoring (e.g. SNMP) • Installation, configuration, and maintenance should be minimal • Homogenous hardware and software – eliminate systematic errors in the results by keeping the platform the same. • Solution should be nimble enough to adapt at other facilities (e.g. Tier 3s, US CMS, integration with MDM)
Possible Solutions – ‘Composite’ Approach • Recommend a suite of tools to install, provide guides/workshops • Would require dedicated hardware – configuration and management would be up to local sites • Could recommend a platform • Could rely on local staff to choose a machine that meets criteria • Edge Solution vs. Network Core • Exchange Points, Backbone, Regional, and Campus networks are working to make perfSONAR data within the network core • Edge facilities, e.g. where the science data lives and is processed, can get involved in the same manner
Possible Solutions – ‘Composite’ Approach • Positive Features: • Relative cost is low – existing hardware can be used to run most measurement tools • Integration with existing tools on other networks • Drawbacks: • Labor intensive for local staff to maintain system • Hardware may differ from site to site • Development team may spend a lot of time in a ‘support’ role (e.g. getting the tools installed and configured).
Possible Solutions – ‘Appliance’ Approach • Prepare the entire environment for the target use • Uniform system makeup/hardware/versions • Pre-installed and pre-configured • Centrally Managed Solution: • Central facility monitors the health and updates of the framework • Support available for software and hardware • Locally Managed Solution • Local institutions monitor daily activities • Software support (e.g. configuration, bug and security patches) is available • Regular updates anticipated to address bugs/enhancements • Edge Solution vs. Network Core • As in the ‘Composite’ approach, the edges can still participate at a protocol level with all perfSONAR products
Possible Solutions – ‘Appliance’ Approach • Positive Features: • Integration with existing tools on other networks • Homogonous software and hardware • Easy maintenance and upgrade path • Drawbacks: • Costs associated with management • Hardware support • Potentially contracts to manage the software functionality and operation
Solution – pS Performance Toolkit (pSPT) • In short this is a Locally Managed Appliance • The pSPT is a bootable CD • Contains all necessary software in a single package • Wizard interface to configure aspects of the system • Upgrade path is simple: burn a new CD and reboot! • Hardware is similar at all USATLAS locations, everyone has 2: • ‘KOI’ 1U Server • Pentium 2.2 GHz, Dual Core • 2G RAM • 160GB Hard Drive • Daily operations (e.g. system maintenance and monitoring) done by the local facility • Software support (e.g. updates, interim bug fixes, mailing list for questions) provided by development team
Solution – User Expectations • Installation • Must be ‘easy’. Given the variability in what is easy for a Sysadmin vs. a Physicist vs. an Administrator: we aimed very low. • Burning a CD and rebooting a machine is as simple as it gets • Configuration • Also must be ‘easy’. Step by step instructions to guide the user through the process of configuring the system and tests. • Ability for power users to skip the guided approach • Status and feedback on the process • Operation • System should work without human intervention. • Reboots or system halts should not result in a loss of data or configuration. Resume operation when back up and running • System should alert when in distress
Solution – User Expectations – cont. • Maintenance • Once again: ‘easy’. Most maintenance tasks, e.g. checking the disk and software, can be automated. • Integration into alert systems (Nagios) is in progress • Upgrades should be the same as installation. • Data Use • Way to access collected data – either through GUIs or web services • Easy interpretation of results, e.g. make sure all the measurement we are doing is actually useful (!) • Support • Security patches and bug fixes must be made available in a timely manner • Method must exist to ask questions on installation, configuration, upkeep • Community (e.g. US ATLAS) can self support along with help from the development team over time.
Solution – Use Cases • US ATLAS (Scientific VO) Use Case: • 2 Servers per facility • Bandwidth Testing • Latency Testing (sensitive – isolated from other measurements) • Configuration is one time (initial) • Configure system information (network settings, location) • GUIs to guide regular test set up • State and measurement data saved on local disk • Maintenance consists of examining data, and upgrading CD when required • Testing is designed to occur without intervention • Data consumption • On-board GUIs to visualize the results • Built on perfSONAR platform – data can be easily shared (and located in other locations) to construct new GUIs.
Solution – Use Cases – cont. • Regional/Campus Network Use Case: • Simple deployment within the core and edges of the network • Integrates into perfSONAR deployment on a backbone or exchange point • General Diagnostic Use Case: • Instant availability of a testing point anywhere in the network • Will not harm the operating system of a non-dedicated resource (e.g. 1 time use) • Remote Facility Use Case: • Non-technical staff can easily deploy for diagnostic purposes • Interval and magnitude of testing can be adjusted to account for network availability
Solution – Supporting the Infrastructure • perfSONAR-PS Development Team • ESnet, Fermilab, Indiana University, Internet2, SLAC, University of Delaware • Collaboration with perfSONAR-MDM to ensure protocol compatibility of all software and services • US ATLAS Support: • Regular release schedule (~4 per year), on demand releases if something goes wrong • Alerts on vulnerabilities, patches made in a timely manner • Feedback mechanism for bug reports and enhancements • Mailing list for discussion • Maintained by developers • Encourages building a community to answer questions and solve non-software related problems
Solution Details – Operations • Installation of 2 Hosts per facility (Tier 2s) • Latency and Bandwidth Hosts • Position near the ‘Edge’ of the facility • Optional: add another host near the storage/compute nodes • Installation of 1 ‘large’ host (Tier 3s, RENs) • Position near the edge • Can run both bandwidth and latency tests, but results may be tainted • ‘Edge’ Institutional deployment options: • Border is good for testing connectivity to outside world and path decomposition for problem diagnosis • Co-located with compute/storage is good for testing what the application will see (e.g. if they travel through a firewall, etc.)
Solution Details – Measurement BCP • Bandwidth • TCP tests every 4 hours, 20 seconds in length. Test to all Tier2s and the Tier1 • Latency • One Way – Constant stream of 10 packets per second. Test to and from all Tier2s and the corresponding Tier1 • Round Trip – 10 Packets every 5 minutes. Test to all Tier2s and the Tier1 • Passive Monitoring • No official stance – interest in making border router data available as well as any links of interest • Currently do not use circuit status monitoring (e.g. E2Emon)
Success Stories - Ultralight • University of Michigan to BNL • Poor performance for a single direction • Traversed 5 networks (UofM, Ultralight, Internet2, ESnet, BNL) • perfSONAR available on all parts of the path (e.g. demarcation points). Simplified due to infrastructure being in place • Process: • Test from BNL to each intermediate point • Test from UofM to each intermediate point • Isolated the problem to a single section of the Ultralight network • Once we know where to look, we had to figure out what to do: • Physical Infrastructure – no damage to infrastructure found. Cleaning performed for good measure. • Hardware – line cards properly seated. No errors found Software – Router operating systems up to date? Any unchecked alarms?
Success Stories – Ultralight – cont. • Soft Failure: • A fault or situation that doesn’t cause loss of connectivity, but will impact performance • May go unnoticed for long periods of time • May impact select set of users • Ultralight switch was flooded with a global routing table from a peer – caused an unchecked warning flag in configuration • Limited buffer sizes – even though switch was configured to have these be large • Performance tools (NDT) noticed the discrepancy • The fix is to to upgrade software and reboot • Publication: http://www.internet2.edu/performance/200904-CS-UL.pdf
Success Stories - REDDnet • REDDnet (e.g. US CMS Tier3 Vanderbilt University) • Distributed data storage – equipment co-allocated at LHC schools and positioned near other resources (compute resources, core infrastructure) • Observed that between may facilities, the data transfer activities was taking much longer than expected (orders of magnitude slower) • Solution was two parts: • Diagnostics: • Install tools to get a base line of performance • Helped to identify where effort should be spend first • Regular monitoring: • Regular bandwidth and latency testing • Establish patterns – e.g. is congestion heavily influencing the performance or is something wrong in design
Success Stories – REDDnet – Cont. • Problem Breakdown: • Campus network design • Most facilities featured firewalls where scientific traffic was treated the same as enterprise traffic • Performance tools were able to spot excessive queuing and dropped traffic • Death sentence for large data transfers – explaining the situation with campus administrators cleared this up immediately • Hardware limitations • Local administrators assigned non-capable unmanaged switches on occasion • Performance tools were able to spot buffering bottlenecks immediately • Replacing hardware is best solution – configuring settings where applicable will work also
Success Stories – REDDnet – Cont. • Problem Breakdown (cont.): • Unchecked physical infrastructure errors • Latency tools detected a steady stream of loss on a given link • Network staff on a downstream network were contacted to view passive monitoring data (it was not available through perfSONAR) • CRC errors found on a dirty fiber in the demarcation of the networks • Hardware Failure • Bursty loss observed on a given link, tools were able to isolate to a single device. • Observing the device over a 2 day period showed that processing load would spike every couple of minutes. • Device had two power units (primary and backup). The secondary unit was not completely plugged into the wall – the power management software was constantly flapping and effecting the routing ability
Success Stories – REDDnet – Cont. • Problem Breakdown (cont.): • Host Tuning: • Storage and compute nodes were not performing as well as monitoring units predicted • Same old story: TCP settings on resources were not appropriate for the job at hand. • See also: http://fasterdata.es.net/TCP-tuning/background.html • Epilogue: • Regular monitoring now in place – Nagios configured to give alarms when expected performance drops below a threshold • Data transfers working at expected levels – REDDnet is now prepared for LHC turn on.
Future Directions • Future Directions • Further pSPT Development • Expanding to other scientific communities • Circuit Monitoring • Upcoming event: • NSF CISE (and OCI) sponsoring a workshop, Summer 2010 • Bring together researchers and R&E network operators to spread perfSONAR deployment and see how to create some lasting relationships between researchers and R&E operators in order to create conduits
Future Directions – pSPT Enhancement • Common complaints from users: • GUIs focused around user diagnosis, not specific tool metric display • Need more guidance in what must be vs.. what could be configured in the wizard GUIs • Want to be able to install directly to a host (eliminate CD) • Support for a wider range of 10G hardware (e.g. development team has limited testing access currently) • Tighter integration of tooling – e.g. coordinate the latency testing with bandwidth testing to not overlap measurements. • Integration with Nagios: • Process monitoring and alerting • Data monitoring (e.g. expected value drops below a threshold) • Integration into logging infrastructure (syslog-ng) • Integration of circuit (static and dynamic) monitoring
Future Directions – pSPT Enhancement • Roadmap for 3.2 series (Late Summer of 2010): • Migration to Red Hat/CentOS Live CD platform • Mirrors software infrastructure of most LHC facilities • Re-design of Wizard interfaces • Nagios/Logging Upgrades • Circuit Monitoring Integration • Testing on a wider variety of hardware
Future Directions – Other VOs • Currently working with other VOs that anticipate performance monitoring needs: • LSST – Telescopes • NEES – Earthquake simulation • Other Physics Communities (Daya Bay, LIGO) • Overall concept of the pSPT won’t change • Other VOs have different operational requirements and capabilities • Specific aspects of performance may matter more (e.g. stability vs.. raw bandwidth)
Future Directions – Circuit Monitoring • Circuit Monitoring is extremely important, both in terms of static links and dynamic circuit networks. • Recent Demonstrations in GLIF focused on operational aspects (difficulty of setup, what can be shown): • Multi-domain circuit monitoring (Fall 2008) • Granularity of circuits, e.g. identifying domain specific components via information services (Fall 2009) • Recent work in the OGF: • Standardizing the methods used to name and locate circuits and segments • Push to define dynamic circuit architecture and protocols
Future Directions – Circuit Monitoring • Desirable goals: • Define a succinct system that meets the needs of both worlds • Methods to share information with related systems, e.g. circuit identification must be tied to monitoring status and performance • perfSONAR-PS consortium is looking these problems currently: • Desire to integrate into dynamic circuit (e.g. IDC protocol) operations • Desire to distribute functionality on a future release of the pSPT
Conclusion • Different ways to approach monitoring a loosely coupled VO • Approach is dictated by resources • Scalability is a large factor • Not only of software/hardware architecture but of human resources available for installation, testing, and maintenance as well • Development will address the needs of the community and advance the usefulness of the tools • Accepting feedback to make the product better • Offering limited but clearly stated support • Educating the users is a priority – a high amount of confidence in a tool comes from comfort • Questions?
perfSONAR use at US LHC Facilities March 9th 2010, LHCOPN Eric Boyd, Deputy Technology Officer For more information, visit http://psps.perfsonar.net/toolkit