210 likes | 356 Views
Improving ENOC ’s support for CODs COD-18, Abingdon, UK. Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2) 2008-12-03. Outlines. ENOC and COD interactions Status of work around network trouble tickets DownCollector Assessment Review of last 12 months
E N D
Improving ENOC ’s support for CODsCOD-18, Abingdon, UK Guillaume Cessieux (CNRS, IN2P3-CC / EGEE SA2) 2008-12-03
Outlines ENOC and COD interactions Status of work around network trouble tickets DownCollector • Assessment • Review of last 12 months Proposal to handle DownCollector’s troubles • Processes • Tools’ improvements COD18 2008-12-03
EGEE Network Operating Centre ENOC • Aiming to provide support for: • Sites • ROCs • CODs • Hard to get feedbacks and requirements from SA1 • “Two different worlds”... • Now real-life background with better vision 0.5 FTE in EGI, main changes MUST happen before • Drop unnecessary things, focus on useful • Network support wider role than the ENOC in EGI COD18 2008-12-03
Current status with COD Only DownCollector seems now to be used by CODs [ https://ccenoc.in2p3.fr/DownCollector/ ] • Very efficient integration in COD’s dashboard SA2 is willing to know how to better serve CODs around network support • Regarding processes • Balance between wait and see & over-engineered things • Regarding tools and integration • DownCollector, other tools, CIC dashboard, alarms … Use background to sketch wise, realistic and useful processes and tools COD18 2008-12-03
Around network trouble tickets (1/2) ~ « Main router is down. Will be up soon. » Currently TTdrawlight [ https://ccenoc.in2p3.fr/TTdrawlight/ ] • Repository of network trouble tickets • Not enough accurate & hard to be used efficiently Network trouble tickets are not a panacea • « Главным образом сеть вниз. Будет вверх скоро» • Targeted for a local community • But often the only operational information available… Strong privacy issuesto share network trouble tickets • No filtering of sensible information delivered (school, military…) • Fear of comparison and competition • Knowledge database of networks trouble tickets compromised? COD18 2008-12-03
Around network trouble tickets (2/2) 19 NRENs currently sending their tickets to the ENOC • EGEE relies on networks from ~ 50 NRENs + GÉANT2 • We cover ~80% of European Grid sites • 2800 e-mails for 900 tickets/month • Really hard to deal with meaning of tickets (location, duration...) Standardisation of network TT? • Can enable painless, accurate and automatic management of TT • Strong advances in this domain but hard to promote to NRENs Situation to be sorted out between NRENs & SA2 • Solve centralisation, accuracy and exposure of TT • Then tools will easily follow COD18 2008-12-03
Around network monitoring Connectivity addressed with DownCollector • But not performance Hard to have information on end-to-end performances • Require to go on network paths and devices details • 300 certified sites, 50 NRENs... Inhomogeneous domains • Network is shared, should be monitored once and not at project level • Slowly converging toward perfSONAR – not yet mature EGEE Network troubleshooting tool upcoming • Lightweight package from SA2 • Prototype around January 2009 COD18 2008-12-03
DownCollector (1/3) Now a key tool reporting TCP listening of Grid nodes 2 minutes accuracy • 2600 nodes pooled • Often first to detect some failures GOCDB Scheduled downtimes are managed • Troubles not reported for sites in scheduled downtimes COD18 2008-12-03
DownCollector (2/3) GÉANT2 OFF-SITE NREN X checkpoint ON-SITE A trouble = All Grid hosts of a site unreached • To avoid measuring host availability Network checkpoint = border router • Demarcation point for ENOC’s responsibility • Checked during trouble Three kinds of troubles • OFF-SITE: Network checkpoint NOT reached • Fault in: WAN, MAN, NREN, GÉANT2, ISP... • ON-SITE: Network checkpoint reached • LAN, power, software ... • UNKNOWN: No clear and reliable checkpoint, but site in trouble COD18 2008-12-03
Foreign site 2 French site GÉANT2 Router B Router A NREN X RENATER Checkpoint for site 1 ENOC Foreign site 1 DownCollector (3/3) COD18 2008-12-03 Is it trustable or biased? • If failure reported from ENOC is failure from entire infrastructure? • For ON-SITE troubles: ~YES • What about French sites reached without using GÉANT2? remote probes? • 2 instances of DownCollector? ~NO
Troubles detected by DownCollector Number of troubles Troubles are not concentrated on few sites! • Scope • (300 certified sites) • Last 12 months Number of troubles per month: COD18 2008-12-03 54% of detected problems are ON-SITES
Troubles’ durations Last 12 months troubles’ dispatching: COD18 2008-12-03 80% solved within 30 min • Pareto’s law The others • OFF-SITE • Avg 45 troubles/month • ON-SITE • Avg 85 troubles/month
Yearly sum of downtimes per sites 164 sites have less than 1d of downtime during last 12 months 46 sites Last 12 months total downtime for site 46: 4d OFF-SITE, 17d ON-SITE 85% of sites <4d of downtime/year = 98.90% reachability/year N.B: unscheduled downtime Better: 4 minutes down Worst: 64 days (PPS…) COD18 2008-12-03
First assessment Networks are quite reliable • Few long outages on resilient transit networks • ON-SITE troubles are important things • 30 minutes seems a wise threshold • DownCollector seems reliable and trustable enough Automatic management of network TT currently not reliable Currently few interactions SA2 / CODs This was discussed with pole1 for improvements • Thanks to them for feedbacks, results are following COD18 2008-12-03
Proposal for troubles handling Map troubles handling around the three kinds of problem from DownCollector COD18 2008-12-03
OFF-SITE troubles handling ENOC please follow that ENOC’s responsibility – devolving trouble resolution to NRENs/GÉANT2 Targeted key information: expected end date • Hard to get… Enable marking of particular outages • Maybe then automatically create a ticket into ENOC’s helpdesk (GGUS) to exchange information with COD COD18 2008-12-03
Proposal for tools (1/2) ON-SITE UNKNOWN Trouble OFF-SITE -5h Now ENOC to work with sites to improve some network checkpoints • Reduce number of unknown troubles (~ 12%, ~106/month) • 351 sites in database: 32 (9%) without usable checkpoint • [ https://ccenoc.in2p3.fr/DownCollector/?v=list_headnodes ] ENOC’s bar in COD dashboard COD18 2008-12-03
Proposal for tools (2/2) 1.5 - select threshold Notification from DownCollector to site admins for long-standing outage (15 or 30 minutes?) • Integration to Nagios not sufficient? • Existing DownCollector feature: Subscribe to troubles • [ https://ccenoc.in2p3.fr/DownCollector/?v=subscription ] • Released with EGEE broadcast on 2008-07-16 • 34 sites, 26 distinct emails have currently registered • Noticed problem: E-mails not reaching disconnected sites… • No threshold implemented yet COD18 2008-12-03
Actions list for tools ENOC • DownCollector • Improve checkpoints • Add threshold to subscribe feature? • Allow flagging important network outages and study scheme to exchange around (GGUS ENOC’s helpdesk...) • Provide ENOC’s bar CIC portal • Manage networks alarms & alarms masking • Integrate ENOC’s bar COD18 2008-12-03
Conclusion Its really going ahead Some implementation details to sort out • Scalability, regionalisation • Right now or waiting your next model (alarm DB, R-COD etc.)? • CIC portal & ENOC • priorities, manpower and roadmap Other ideas, feedbacks etc. always welcome • Help designing the network support you need COD18 2008-12-03
Questions? COD18 2008-12-03