210 likes | 231 Views
MonALISA capabilities for the LHCOPN. MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu , Costin Grigoras , Ciprian Dobre , Alexandru Costan. USLHCNet Team Harvey Newman, Artur Barczyk , Ramiro Voicu , Azher Mughal , Sandor Rozsa. LHCOPN meeting March 2010 London.
E N D
MonALISA capabilities for the LHCOPN MonALISA Team IosifLegrand, Harvey Newman, Ramiro Voicu, CostinGrigoras, CiprianDobre, AlexandruCostan USLHCNet Team Harvey Newman, ArturBarczyk, Ramiro Voicu, AzherMughal, SandorRozsa LHCOPN meeting March 2010 London
Outline • MonALISA Framework • Architecture • Data handling • Automatic actions • USLHCNet • Network topology • Monitoring modules • Reliable monitoring & accounting • Alarms & triggers • Conclusions 2 Ramiro Voicu LHCOPN London March 2010
The MonALISA Architecture Regional or Global High Level Services, Repositories & Clients HL services Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Proxies Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Agents MonALISA services Distributed Dynamic Registration and Discovery-based on a lease mechanism and remote events Network of JINI-Lookup Services Secure & Public Fully Distributed System with no Single Point of Failure 3 Ramiro Voicu LHCOPN London March 2010
MonALISA Service & Data Handling Postgres Data Store Lookup Service Lookup Service Registration Data Cache Service & DB Web Service WSDL SOAP Discovery WS Clients and service Data (via ML Proxy) Predicates & Agents Clients or Higher Level Services Configuration Control (SSL) Applications AGENTS FILTERS / TRIGGERS Dynamic (Re)Loading Collects any type of information Monitoring Modules Push and Pull 4 Ramiro Voicu LHCOPN London March 2010
Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts (emails/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. Local and Global Decision Framework • Traffic • Jobs • Hosts • Apps ML Service Actions based on global information Global ML Services Actions based on local information • Temperature • Humidity • A/C Power • … ML Service Sensors Local decisions Global decisions Ramiro Voicu LHCOPN London March 2010
USLHCNet • USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia. • Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers. • The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities. • Hybrid network: uses both Ciena CD and Force10 routers • 6 transatlantic 10G links at the moment Ramiro Voicu LHCOPN London March 2010
USLHCnet ML weather map Ramiro Voicu LHCOPN London March 2010
Monitoring modules We developed a set of monitoring modules for USLHCNet network devices: • Force10 (SNMP & sFlow) • Traffic per interface • sFlow traffic • Link status monitoring • Ciena Core Director (TL1 – Transaction Language1) • ETTP (Ethernet Termination Point) traffic • EFLOW (Ethernet Flow) traffic • OSRP (routing protocol) topology • VCG Provisioned / Available Bandwidth • Dynamic circuits inside the optical core of the network • Ping module/MLPing trigger which sends alarms in case of packet loss Ramiro Voicu LHCOPN London March 2010
USLHCnet monitoring MonALISA @GVA MonALISA @AMS SNMP SNMP TL1 MonALISA @NYC MonALISA @CHI Ramiro Voicu LHCOPN London March 2010
USLHCnet redundant monitoring MonALISA @GVA MonALISA @AMS Each Circuit is monitored at both ends by at least two MonALISA services; the monitored data is aggregated by global filters in the repository MonALISA @NYC MonALISA @CHI Ramiro Voicu LHCOPN London March 2010
Local and global filters • Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems • The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services • The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment. • We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS. Ramiro Voicu LHCOPN London March 2010
USLHCnet: Precise measurements for the Operational Status on the WAN Link • Operations & management assisted by agent-based software • Used on the new CIENA equipment used for network managment Ramiro Voicu LHCOPN London March 2010
USLHCnet: ALL EFLOW traffic - last 2 months Ramiro Voicu LHCOPN London March 2010
USLHCnet: Accounting for Integrated Traffic Ramiro Voicu LHCOPN London March 2010
USLHCnet: Ciena alarms monitoring Ramiro Voicu LHCOPN London March 2010
NETWORKS ROUTERS AS Topology monitoring and discovery Real Time Topology Discovery & Display Ramiro Voicu LHCOPN London March 2010
Storage discovery in Alice • distance(IP, IP) • Same IP-class network • Common domain name • Same AS • Same country (+ function of RTT between the respective AS-es if known) • If distance between the AS-es is known, use it • Same continent • Far away • distance(IP, Set<IP>): Client's public IP to all known IPs for the storage France Nordic Countries Italy Russia USA C. Grigoras (Alice) – ACAT 2010 Ramiro Voicu LHCOPN London March 2010
FDT Bandwidth tests in Alice (E2E avbw) http://monalisa.cern.ch/FDT/ Newer kernel Tuned TCP Buffers 1 Gbps network card Default kernels Default TCP Buffers Different trends = different kernels 100 Mbps network card Ramiro Voicu LHCOPN London March 2010
Conclusions http://monalisa.caltech.edu http://repository.uslhcnet.org • The MonALISA framework provides a flexible and reliable monitoring infrastructure • 350+ installed services, 1.5M+ unique parameters, 25kHz value updates • Truly distributed architecture with no single points of failure • Highly modular platform • Automatic decision taking capability at both local and global levels • USLHCNet provides a hybrid network with support for circuit oriented network services • Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime (100% in the last 6 months) • We are investigating dynamic provisioning of circuits from collaborating agents Ramiro Voicu LHCOPN London March 2010
Monitoring Optical Switches Dynamic restoration of lightpath if a segment has problems Ramiro Voicu LHCOPN London March 2010
Controlling Optical Planes Automatic Path Recovery CERN Geneva USLHCnet Internet2 Starlight CALTECH Pasadena Manlan 200+ MBytes/sec From a 1U Node FDT Transfer “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s 4 2 3 1 4 fiber cut emulations 4 Fiber cuts simulations Ramiro Voicu LHCOPN London March 2010