200 likes | 347 Views
Monitoring and operational management in USLHCNet. Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa. CHEP09 - March 2009 Prague. Outline. MonALISA Framework Architecture Data handling
E N D
Monitoring and operational management in USLHCNet Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa CHEP09 - March 2009 Prague
Outline • MonALISA Framework • Architecture • Data handling • Automatic actions • USLHCNet • Network topology • Monitoring modules • Reliable monitoring & accounting • Alarms & triggers • Conclusions 2 Ramiro Voicu CHEP09 Prague March 2009
The MonALISA Architecture Regional or Global High Level Services, Repositories & Clients HL services Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Proxies Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Agents MonALISA services Distributed Dynamic Registration and Discovery-based on a lease mechanism and remote events Network of JINI-Lookup Services Secure & Public Fully Distributed System with no Single Point of Failure 3 Ramiro Voicu CHEP09 Prague March 2009
MonALISA Service & Data Handling Postgres Data Store Lookup Service Lookup Service Registration Data Cache Service & DB Web Service WSDL SOAP Discovery WS Clients and service Data (via ML Proxy) Predicates & Agents Clients or Higher Level Services Configuration Control (SSL) Applications AGENTS FILTERS / TRIGGERS Dynamic (Re)Loading Collects any type of information Monitoring Modules Push and Pull 4 Ramiro Voicu CHEP09 Prague March 2009
Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts (emails/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. Local and Global Decision Framework • Traffic • Jobs • Hosts • Apps ML Service Actions based on global information Global ML Services Actions based on local information • Temperature • Humidity • A/C Power • … ML Service Sensors Local decisions Global decisions Ramiro Voicu CHEP09 Prague March 2009
Monitoring architecture in ALICE AliEn CE AliEn CE Cluster Monitor Cluster Monitor AliEn IS AliEn Optimizers AliEn Job Agent AliEn Job Agent AliEn Brokers ApMon ApMon AliEn TQ ApMon ApMon ApMon ApMon AliEn SE AliEn SE ApMon ApMon ApMon ApMon MySQL Servers ApMon ApMon ApMon CastorGrid Scripts AliEn Job Agent AliEn Job Agent AliEn Job Agent AliEn Job Agent ApMon ApMon ApMon ApMon ApMon API Services ApMon MonALISA LCG Site MonALISA @CERN MonALISA @Site job slots net In/out run time cpu time free space processes load jobs status vsz sockets rss migrated mbytes See Costin Grigoras’ poster (067): Automated agents for management and control of the ALICE Computing Grid active sessions Aggregated Data nr. of files open files Queued JobAgents MonaLisa Repository job status Alerts cpu ksi2k Actions Long History DB disk used MyProxy status LCG Tools 6 Ramiro Voicu CHEP09 Prague March 2009
USLHCNet • USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia. • Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers. • The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities. • Hybrid network: uses both Ciena CD and Force10 routers • 4 transatlantic 10G links at the moment (6 links in the second part of this year)* * See Harvey Newman talk[502] from Monday: “Status and outlook of the HEP network” Ramiro Voicu CHEP09 Prague March 2009
USLHCnet ML weather map Ramiro Voicu CHEP09 Prague March 2009
Monitoring modules We developed a set of monitoring modules for USLHCNet network devices: • Force10 (SNMP & sFlow) • Traffic per interface • sFlow traffic • Link status monitoring • Ciena Core Director (TL1 – Transaction Language1) • ETTP (Ethernet Termination Point) traffic • EFLOW (Ethernet Flow) traffic • OSRP (routing protocol) topology • Dynamic circuits inside the optical core of the network Ramiro Voicu CHEP09 Prague March 2009
USLHCnet monitoring MonALISA @GVA MonALISA @AMS SNMP SNMP TL1 MonALISA @NYC MonALISA @CHI Ramiro Voicu CHEP09 Prague March 2009
USLHCnet redundant monitoring MonALISA @GVA MonALISA @AMS Each Circuit is monitored at both ends by at least two MonALISA services; the monitored data is aggregated by global filters in the repository MonALISA @NYC MonALISA @CHI Ramiro Voicu CHEP09 Prague March 2009
Local and global filters • Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems • The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services • The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment. • We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS. Ramiro Voicu CHEP09 Prague March 2009
USLHCnet: Precise measurements for the Operational Status on the WAN Link • Operations & management assisted by agent-based software • Used on the new CIENA equipment used for network managment Ramiro Voicu CHEP09 Prague March 2009
USLHCnet: Traffic on different segments Ramiro Voicu CHEP09 Prague March 2009
USLHCnet: Accounting for Integrated Traffic Ramiro Voicu CHEP09 Prague March 2009
USLHCnet: Ciena alarms monitoring Ramiro Voicu CHEP09 Prague March 2009
The Need for Planning and Scheduling for Large Data Transfers In Parallel Sequential 2.5 X Faster to perform the two reading tasks sequentially Ramiro Voicu CHEP09 Prague March 2009
Monitoring Optical Switches Dynamic restoration of lightpath if a segment has problems Ramiro Voicu CHEP09 Prague March 2009
Controlling Optical Planes Automatic Path Recovery CERN Geneva USLHCnet Internet2 Starlight CALTECH Pasadena Manlan 200+ MBytes/sec From a 1U Node For more details, see Iosif Legrand’s poster (054): A High Performance Data Transfer Service FDT Transfer “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s 4 2 3 1 4 fiber cut emulations 4 Fiber cuts simulations Ramiro Voicu CHEP09 Prague March 2009
Conclusions • The MonALISA framework provides a flexible and reliable monitoring infrastructure • 350+ installed services, 1.5M+ unique parameters, 25kHz value updates • Truly distributed architecture with no single points of failure • Highly modular platform • Automatic decision taking capability at both local and global levels • USLHCNet provides a state-of-the-art hybrid network with support for circuit oriented network services • Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime • We are investigating dynamic provisioning of circuits from collaborating agents http://monalisa.caltech.edu http://repository.uslhcnet.org Ramiro Voicu CHEP09 Prague March 2009