550 likes | 683 Views
LHCOPN operational model - 4 use-cases. Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin. Agenda. Focus on 4 use-cases: Incident Management L3: Power outage at DE-KIT leading to routers down
E N D
LHCOPN operational model -4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, 2009-01-15, Berlin
Agenda Focus on 4 use-cases: • Incident Management • L3: Power outage at DE-KIT leading to routers down • L2: Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001 • Change Management • L3: New IP prefixe for ES-PIC • Maintenance Management • L2: USLHCNET's scheduled power cut for devices in Chicago GCX - LHCOPN meeting - 2009-01-15
Tools used • CERN’s twiki • https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsModelUseCases • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsContacts • GGUS • Public release 2009-02-01 • Monitoring • MDM, e2e2mon, ASPDrawer... GCX - LHCOPN meeting - 2009-01-15
L3 incident management Power outage at DE-KIT leading to routers down GCX - LHCOPN meeting - 2009-01-15
Scope • 2 routers unexpectedly down • Affected: • NL-T1, CH-CERN, IT-INFN-CNAF, FR-CCIN2P3, DE-KIT • 5 links GCX - LHCOPN meeting - 2009-01-15
L3 incident managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_incident_management_process Scope: Router down, BGP filtering, bad routing...The source site is the site where the problem lies. 1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site. 1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site). 1.3 If the problem is related to an underlying layer (L2: dark fibre outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS. 1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed. 2 The LHCOPN TTS notifies all impacted sites about the incident GCX - LHCOPN meeting - 2009-01-15
L3 Incident management process Site involved Router operators Source site involved 1.2 Grid Data contact * Router operators 1.4 1.1 LHCOPN TTS (GGUS) Affected sites (1.3) 2 L2 incident management A B A interacts with B A B A notifies B V0.5 20081215 gcx A B A goes to process B A B A reads and writes B
Ticket opening DE-KIT 1.1 * Router operators LHCOPN TTS (GGUS) 1.1 A DE-KIT router operator opens a trouble ticket into GGUS GCX - LHCOPN meeting - 2009-01-15
GGUS submit interface GCX - LHCOPN meeting - 2009-01-15
Ticket opened GCX - LHCOPN meeting - 2009-01-15
Other steps • Outage is localised and noticed by source site • No need to perform 1.2: Contact counterpart on distant site • This is a power cut, not a real L2 problem • No need to go further on 1.3: L2 incident management process GCX - LHCOPN meeting - 2009-01-15
Grid interaction DE-KIT Grid Data contact * Router operators 1.4 1.1 LHCOPN TTS (GGUS) • 1.4: Grid data contact at DE-KIT is warned about the outage – GGUS TTid provided • He will compute impact on the Grid • He will warn the Grid GCX - LHCOPN meeting - 2009-01-15
Automatic broadcasting DE-KIT Grid Data contact CH-CERN, FR-CCIN2P3, IT-INFN-CNAF, NL-T1, DE-KIT * Router operators 1.4 1.1 LHCOPN TTS (GGUS) 2 2: The GGUS TTS will warn all affected sites • This is done when ticket is submited GCX - LHCOPN meeting - 2009-01-15
Following/Closure • Incident registration and broadcasting is terminated • DE-KIT router operator is in charge of updating/ closing the GGUS ticket • Affected sites will be notified • Local Grid data contact has also to be warned GCX - LHCOPN meeting - 2009-01-15
History GCX - LHCOPN meeting - 2009-01-15
Conclusion for first use case • Shortcut as the incident is quickly localised • Otherwise more interactions between sites • Deeply organised around GGUS tickets • Could be opened by another site and assigned to DE-KIT • Put status from « assigned » to « in progress » to acknowledge GCX - LHCOPN meeting - 2009-01-15
L2 Incident management Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001 GCX - LHCOPN meeting - 2009-01-15
Scope • Router operatorat UK-T1-RAL noticedthatlinkis down thanks to their monitoring system • Affected • 1 link: CERN-RAL-LHCOPN-001 • 2 sites: CH-CERN and UK-T1-RAL • Not clearidea of what and where the problemis • Router down at CH-CERN, fibre cut… GCX - LHCOPN meeting - 2009-01-15
Global problem management process started GCX - LHCOPN meeting - 2009-01-15
Quick investigation 1- Nothing seems occurring on site 2- Take an overview of the LHCOPN • e2emon monitoring system indicates that the L2 link is down in segment “UKERNA” • Now tracking a fibre cut • Nothing seems registered on GGUS about • Unscheduled event = Incident • Going to L2 incident management GCX - LHCOPN meeting - 2009-01-15
L2 incident managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_incident_management_process Scope: Dark fibres outages... 1.1 A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site 1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS.1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact. 2 All impacted sites will be notified by the TTS. 3 If nothing if found at L2 the Escalated incident management process is started. GCX - LHCOPN meeting - 2009-01-15
L2 Incident management process escalated incident management * L2 NOC (3) Sites linked Sites linked 1.1 1.3 * Router operators Grid Data contact 1.2 * End of L3 incident management 2 LHCOPN TTS (GGUS) Affected sites A B A interacts with B A B A notifies B A B A reads and writes B V0.5 20081215 gcx
Incident registration UK-T1-RAL JANET NOC 1.1 * Router operators 1.2 LHCOPN TTS (GGUS) • 1.1 Router operator at UK-T1-RAL will open a ticket to JANET for the outage • 1.2: UK-T1-RAL noticed the outage so will open a ticket into GGUS for the LHCOPN community • Self assigned to them because under their responsibility (T0-T1) GCX - LHCOPN meeting - 2009-01-15
GGUS ticket submited GCX - LHCOPN meeting - 2009-01-15
Broadcasting Sites linked UK-T1-RAL 1.1 1.3 JANET NOC * Router operators Grid Data contact 1.2 2 LHCOPN TTS (GGUS) CH-CERN 1.3: Grid interaction • Local Grid data contact warned (+ #GGUS-TTid) 2: Other affected sites automaticaly notified by GGUS GCX - LHCOPN meeting - 2009-01-15
Following/Closure • UK-T1-RAL will update GGUS tickets with information from JANET • Grid data contact and affected sites are kept updated • Ticket will be closed by UK-T1-RAL GCX - LHCOPN meeting - 2009-01-15
Conclusion for second use-case • Accurate and reliable monitoring is required to really shortcut investigations • Key communication between network provider and customer • We did not changed the way this currently works GCX - LHCOPN meeting - 2009-01-15
L3 Change management New IP prefixe for ES-PIC GCX - LHCOPN meeting - 2009-01-15
Scope • ES-PIC has a new IP prefixe that must be included within the LHCOPN • Affected: • All sites: Filters to update… • And monitoring systems GCX - LHCOPN meeting - 2009-01-15
L3 change management https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L3_change_management_process Scope: IP addresses change, new prefix propagated, new filtering The source actor for these changes are router operators. 1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility ...)1.2 Router operator will expose change to affected sites (e.g linked sites) 2.1 The change will be fully documented on the global web repository and some technical information should also be updated2.2 An informational ticket summarizing the change will be put into the LHCOPN TTS. It will contain link to the full documentation of the change (e.g URL to the Global web repository)2.3 The L3 monitoring infrastructure may be adapted if needed (new p2p IPs to be watched...) 3 The LHCOPN TTS notifies all impacted sites 4 If the change has an impact a L3 maintenance management process will be started to commit changes. Else the change could be directly done If we have some L3 changes impacting the L2 (L3 VPN for instance) the L2 change management process should be started. GCX - LHCOPN meeting - 2009-01-15
L3 Change Management Global web repository (Twiki) Source site 2.1 Grid Data contact 1.1 Router * operators 2.2 LHCOPN TTS (GGUS) 3 Affected sites (2.3) 1.2 Monitoring Affected Sites (4) Linked Sites Linked Sites Router operators L3 maintenance management A B A interacts with B A B A notifies B A B A reads and writes B V0.5 20081215 gcx
Change registration ES-PIC Grid Data contact 1.1 Router * operators 1.1: Grid data contact is warned about the change • New hosts will benefit of the LHCOPN? 1.2: This change is common and has no deep impact for others • No need to discuss with impacted sites GCX - LHCOPN meeting - 2009-01-15
Documentation and tool update ES-PIC Global web repository (Twiki) Technical information 2.1 Grid Data contact 1.1 Router * operators 2.1 Change management DB • 2.1: • The change will be documented on the change management database • https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase • Technical information will be updated • https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnIpAddresses • https://twiki.cern.ch/twiki/bin/view/LHCOPN/OverallNetworkMaps GCX - LHCOPN meeting - 2009-01-15
Broadcasting 2.2: A « informational » GGUS ticket will be created • With link to the change management database entry • With link to technical information updated • 3: All sites will be notified • 3: DANTE Operation + ENOC are put in copy • New prefixes might need to be also monitored by MDM + ASPDrawer GCX - LHCOPN meeting - 2009-01-15
GGUS submit interface Dante.ops@dante.net + ENOC GCX - LHCOPN meeting - 2009-01-15
Summary Global web repository (Twiki) Technical information 2.1 Change management DB ES-PIC Grid Data contact 1.1 Router * operators 2.2 3 LHCOPN TTS (GGUS) ALL Sites (2.3) DANTE Operation MDM Monitoring BGP 36 ENOC GCX - LHCOPN meeting - 2009-01-15
Committing the change (1/2) • The change is documented and advertised but not yet committed • Has the change, or its commitment, impact on existing service? • No, so no need to commit it within a “true” maintenance GCX - LHCOPN meeting - 2009-01-15
Committing the change (2/2) • The change will be silently implemented by ES-PIC and reported with a GGUS ticket • Kind: Maintenance L3 • To track implementation + statistics GCX - LHCOPN meeting - 2009-01-15
Conclusion for third use-case • Documenting and implementing are separated • 2 tickets: Informational & Maintenance • Third party tools might need to be updated • MDM, e2emon, ASPDrawer, GGUS … • Lighten process for non impacting changes GCX - LHCOPN meeting - 2009-01-15
L2 maintenance management USLHCNET's scheduled power cut for devices in Chicago GCX - LHCOPN meeting - 2009-01-15
Scope (1/2) USLHCNET will have power cut in Chicago GCX - LHCOPN meeting - 2009-01-15
Scope (2/2) • Fictional impact: • US-FNAL-CMS will be fully disconnected GCX - LHCOPN meeting - 2009-01-15
L2 maintenance managementhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel#L2_maintenance_management_proces Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...) Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried. 1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process. 1.2 The router operator will warn its Grid data contact (and may check with him date is ok) 1.3 The router operator may check with distant affected sites - off the record - that the date is suitable 1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1 . Else the maintenance is posted in the LHCOPN TTS by the router operator. 2 All impacted sites are notified. 3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed. GCX - LHCOPN meeting - 2009-01-15
L2 Maintenance management process Linked Sites Grid Data contact 1.2 * L2 NOC Router operators 1.1 1.4 LHCOPN TTS (GGUS) 2 Affected sites 1.3 Linked Sites Linked Sites Router operators A B A interacts with B A B A notifies B A B A reads and writes B V0.5 20081215 gcx
Registering maintenance (1/2) US-FNAL-CMS Grid Data contact 1.2 USLHCNET NOC Router operators 1.1 Linked Site CH-CERN 1.3 1.1: USLHCNET warns at least site US-FNAL-CMS • Not Grid, not all LHCOPN sites etc. 1.2: US-FNAL-CMS will warn its local Grid data contact • And may check with him date is OK • 1.3: Ideally also avoid overlap with CH-CERN’s events GCX - LHCOPN meeting - 2009-01-15
Registering maintenance (2/2) • Affected sites: • US-FNAL-CMS, CH-CERN • US-FNAL-CMS is responsible for following this event • 1.4: A FNAL Router operator will put the maintenance into GGUS GCX - LHCOPN meeting - 2009-01-15
GGUS submit interface GCX - LHCOPN meeting - 2009-01-15
Summary US-FNAL-CMS Grid Data contact 1.2 USLHCNET NOC Router operators 1.1 1.4 1.3 LHCOPN TTS (GGUS) 2 CH-CERN Linked Site CH-CERN Router operators GCX - LHCOPN meeting - 2009-01-15
Following US-FNAL-CMS updates ticket according to USLHCNET reports US-FNAL-CMS is in charge to close the ticket when terminated GCX - LHCOPN meeting - 2009-01-15
Ticket’s handling GCX - LHCOPN meeting - 2009-01-15