210 likes | 356 Views
LHCOPN: Operations report. Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08. From last LHCOPN meeting, 2010-06-29, Barcelona. Conclusion on Operations
E N D
LHCOPN: Operations report Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, CERN, 2010-10-08
From last LHCOPN meeting, 2010-06-29, Barcelona • Conclusion on Operations • Unequal following of processes by sites because missing clear feeling of usefulness and evidence of network failures • WLCG relationships are weak • Monitoring and SLD required to really assess Operations • Items not solved • LHCOPN representatives • How to push efficiently for proper solving of some issues/administrative tasks • In clear words: Stress sites and escalate frozen issues • Merging LHCOPN helpdesk with standard GGUS GCX
Outlines • Operation status • TTS stats • Long standing issues & Ops phoneconf report • Operational exchanges with WLCG • Post mortem analysis of some issues • Ease exchanges with WLCG • AOB GCX
Number of tickets put in the LHCOPN TTS per month AVG: 23 tickets/month GCX
Ticket ownership during [2010-07-01,2010-09-31] Joy of terminating 6 LHCOPN links GCX
Conclusion from TTS stats • Workflow stable, but unclear if this is good • Miss SLD & monitoring to correlate and focus on service impacting events • Lot of L2 events (80%) well handled • Often clear cut, easy to spot • Not used to complex issues • Often turning into a long story • packet loss, MTU... GCX
Long standing issues • Only administrative! • Validate prefix acceptance etc. • Wait GGUS feature “clone this ticket and assign it to all impacted sitename” to follow this in a per site basis • Followed during the LHCOPN Ops phoneconf, each 3 months • Recurrent issue: Hard to have administrative issue solved GCX
Issues highlighted by WLCG (1/4) • Painful to spot and a lot not anyhow related to the LHCOPN • #GGUS-54473 transfer error from PIC_DATADISK to SARA-MATRIX_DATADISK • Child issues: #GGUS-54416, #GGUS-54474, #GGUS-54500 • “The two LHCOPN routers at CERN were connected via a VLAN, and VLAN tagging adds 4 bytes to a packet. The MTU between these routers has been increased” • Opened 2010-01-05 12:17, closed 2010-01-08 16:16 • No related LHCOPN tickets GCX
Issues highlighted by WLCG (2/4) • #LHCOPN-58197: Poor performance between CERN and ASGC • Opened 2010-05-12, closed 2010-05-17 • Never updated, only Opened/Closed for the record • Only communication problem, issue was managed • Network staff movement at TW-ASGW, solved • SIR filled https://twiki.cern.ch/twiki/bin/view/LCG/SIRCernAsgcLinkMay2010 • #GGUS-59791: Transfer problem from to INFN-T1_DATADISK to PIC_DATADISK • Child issue: #GGUS-59697 T0 export to INFN-T1_DATADISK failures: No valid space tokens • Opened 2010-07-07 00:06, closed 2010-07-14 18:05 • “Network issue of MTU black hole + route asymetry at CNAF/GARR” • No LHCOPN tickets GCX
Issues highlighted by WLCG (3/4) • # GGUS-61306: Functional test transfer errors to RAL-LCG2_DATADISK • Related to • #GGUS-61942 “NDGF-T1 transfer error from RAL-LCG2 and to BNL-OSG2” • #GGUS-61835 “Transfer errors from NDGF-T1_DATADISK to RAL-LCG2_DATADISK” • #GGUS-62287 “Transfer errors at NDGF-T1_SCRATCHDISK” • Opened 2010-08-19 17:41, closed 2010-09-17 15:09 • #LHCOPN-62228, opened/closed 2010-09-17 • Symbolic for the record, no info into • “The linecard terminating the RAL primary link on the CERN router was replaced and the issue was definitely solved” GCX
Issues highlighted by WLCG (4/4) • 4 LHCOPN issues this year • Nothing particularly wrong • Problem is mainly around communication • Main mistake is forgetting creating a ticket in LHCOPN helpdesk • This was the agreed process • Not aware of any other LHCOPN related issue from WLCG • But others network issues (LAN, Generic IP...) GCX
Separated LHCOPN helpdesk in GGUS, why? (1/3) • Key requirement 2008-03 • Not doing user support, but coordinating network teams • Match operational model, particularly responsibility and notification scheme • Network issue ≠ Grid issue, lot of non service impacting events to be registered into • Avoid disturbing or misleading people • Network teams have no access to standard GGUS • And did not want • Centralize anything related to LHCOPN Ops • Clear desire to be isolated/protected • “If we use standard GGUS this will be a mess” • Real fear of enquiries for anything • Did not want to be considered as a catch all networking support, we should accept only selected enquiries LHCOPN related going through storage teams • So we ended with the LHCOPN helpdesk GCX
Separated LHCOPN helpdesk in GGUS, why? (2/3) • Now • General workflow is agreed, discussion is on way to implement it • Lot of things have evolved • GGUS support scheme, experience in applying processes etc. • Several problems/concerns experienced • Problem cannot be solved independently by network team? • Lot of interaction with storage, system etc. • Aren’t iperf tests or monitoring sufficient? • We miss clear bridge with WLCG Ops • Hope was put in awaited parent/child relationship feature for GGUS tickets • cross helpdesk accesses and exchanges required ? • Enquiries often still have a standard GGUS tickets • “Why creating a LHCOPN TT if there is still a GGUS one ?” • Competition between LHCOPN helpdesk and standard GGUS • Tickets turning out to be network related after some time and investigations • LHCOPN tickets: Overhead or true advantage? • Notification, responsibility, tracking etc. GCX
Separated LHCOPN helpdesk in GGUS, why? (3/3) • So create 12 related support units in the standard GGUS? • LHCOPN_CA-TRIUMF etc. • Will this add happy interactions with everybody? • Can we keep the set of particular features we have and be smartly integrated in current GGUS’ workflow? • Particular view, non service impacting events hidden, categories, tickets for maintenances, notification and assignment scheme ? • Transparent for us? Can a standard ticket be turned into a LHCOPN one? • Aren’t we doing more than user support? GCX
AOB (1/3) • Routing policies • To be documented accurately through a routing matrix • https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies • Escalation process • Existing, but never used • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#Escalated_incident_management_pr • Give this privilege to WLCG people on LHCOPN tickets? • Scheme of responsibilities to be improved? • Set on links basis, so who’s responsible for a IT-INFN-CNAF ↔ US-T1-BNL issue? • Can this really happen without problems between IT-INFN-CNAF ↔ CERN or US-T1-BNL ↔ CERN ? GCX
AOB (2/3) • Issues/requests related to MDM • Must be visible, tracked and centralised like any others LHCOPN issues • Must be in the LHCOPN TTS • Maybe new problem categories etc. to support this • How far? Track software bug or only sites implementation? • DANTE/GN3 could have login/pass to GGUS if no certificate • Any concern about? • Documentation about MDM boxes available? • Should be on the LHCOPN twiki, even very brief • List and IP address of boxes enough? • Hard to solve problems only knowing local boxes • DANTE/GN3 should have R/W access to LHCOPN twiki GCX
AOB (3/3) • Too many off the record e-mails exchanges about LHCOPN issues • MUST be in the LHCOPN TTS • Visible, followed, timestamped etc. • Tickets in the LHCOPN TTS have a clear scheme of responsibilites… not an e-mail sleeping in inbox • If no LHCOPN ticket, no LHCOPN issue GCX
Conclusion • Awaiting monitoring to revitalise Ops • And SLD to really know what matters • Main weakness of LHCOPN Ops: relationship with WLCG • GGUS merging: To be investigated/discussed further • Why not if this solves issues • Be careful with the scope of our model • LHCOPN only • Key reason for having this so specific? • But be careful before changing something working • Wait also EGI networking support and Tiers 2 networking to converge GCX