100 likes | 222 Views
EGEE Operation Procedures. Alexandre Duarte CERN IT-GD-OPS. COD. COD is Operator on Duty global LCG/EGEE GRID monitoring 1 (2) ROCs responsible for the whole GRID operations at a time 12 ROCs involved weekly rotation weekly WLCG-OSG-EGEE Operations meeting ROCS, Tier1, experiments
E N D
EGEE Operation Procedures Alexandre Duarte CERN IT-GD-OPS
COD • COD is Operator on Duty • global LCG/EGEE GRID monitoring • 1 (2) ROCs responsible for the whole GRID operations at a time • 12 ROCs involved • weekly rotation • weekly WLCG-OSG-EGEE Operations meeting • ROCS, Tier1, experiments • all sites invited Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
COD Procedures • https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures • Looking at monitoring tools • SAM, gstat, Certificate Monitoring pages • Open tickets using COD Dasboard • Escalate expired tickets • Process site responses (update tickets accordingly) • End of duty: hand-over notes Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
COD Dashboard • summary of necessary monitoring information + tools for ticket processing • tickets linked to GGUS • GOCDB information • SAM + gstat results • ticket creation and management tool • tools for related e-mail Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
COD Dashboard Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
Escalation Procedure • defines the steps to be taken during the lifetime of a ticket • avaliable on CIC Operations Portal • (https://edms.cern.ch/document/701575) • distinction between sites depending on the amount of resources Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
Escalation Steps • ticket creation • first mail (to: site + ROC) • second mail (to: site + ROC) • suspension from the GRID • before 4.: • mail to ROC • weekly operations meeting call • mail to OMC for validation Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
Escalation Procedure • site categories • low: CPU <20 • normal: 20 < CPU < 100 • high: 100 < CPU • between 2.-3. and 3.-4. • low + normal: 3 days • high: 1 days Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
When deadline reached Create ticket Problem solved ? Close ticket Escalate mail Extend deadline last escalation ? Suspend site mail mail mail site responds Escalation Procedure yes no no Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006
What a site should do • Look at the monitoring tools (SAM) • try to notice & fix failures before the CODs • COD notification about a failure • fix it ASAP • Scheduled downtime • announce it in advance • announce when it's finished • problems → contact the ROC • best way: Create a ticket • question → ask the ROC Lost Island, First EELA ROC-on-Duty Tutorial , 29.11.2006