200 likes | 300 Views
Glenn Patrick Rutherford Appleton Laboratory. GridPP22 1 st April 2009. Key Words. Disaster Planning. Resilience & Performance. Response. ALICE. ALICE Workflows. Calibration & Alignment Express Stream Analysis. Prompt Reconstruction. Tier-0. CAF. CASTOR.
E N D
Glenn Patrick Rutherford Appleton Laboratory GridPP22 1st April 2009
Key Words Disaster Planning Resilience & Performance Response
ALICE Workflows Calibration & Alignment Express Stream Analysis Prompt Reconstruction Tier-0 CAF CASTOR Storage hypervisor – xrootd global redirector RAW Re-processing Simulation, analysis (if free resources) Tier-1 Tier-1 Tier-1 T1 AF Tier-2 Tier-2 Tier-2 Tier-2 Tier-2 Simulation Analysis T2 AF c/o Kors Bos, CHEP 2009
ALICE • Loss of custodial data and T2 data. Would both be handled by restoration from existing replicas. In general, loss of T2 data is less critical as in the ALICE Computing Model there are 3 replicas of ESDs and AODs – mainly loss of resources for analysis until data restored. Loss of custodial data (RAW) is more critical as only original + 1 replica is kept and would need higher priority. • Compute/storage loss. The affected services would be excluded from the production activities at the level of the central AliEn services. Response to incident and all remedy actions would be co-ordinated by the ALICE Grid team in collaboration with the technical and management groups in the affected centre. c/o Cristina Lazzeroni
ALICE • Massive procurement failure. Fairshare use of Grid computing resources is maintained through a internal prioritisation system at the central AliEn level. A loss of a fraction of computing resources will not be directly visible to end users. • Extended outage of Tier 1(> 5 days). Short term changes – stop the replication of RAW and divert traffic to other T1 centres. Stop the processing of RAW and discontinue using the centre as a target for custodial storage of ESDs/AODs. Discontinue T1 – T2 data replication (may affect availability of ESD/AOD at T2). Changes done at level of AliEn central services. Users not directly affected, but processing capacity will be reduced. Highest restoration priority = MSS and replication services. Users informed through mailing lists.
MINOS • Loss of custodial data and T2 data. Use of Tier 1 is limited to MC production – little user analysis. MC data is shipped directly to FNAL, which also holds the master copies of the software and data input to the MC. A data loss on UK T1 would just lose the small amount of data awaiting transfer along with about 200GB of input data which would be retransferred from FNAL. • Compute/storage loss. For a short term loss, would just wait for the system to come back up – one MC production run takes of order months. For a longer term loss, would look to move production elsewhere. • Massive procurement failure. Would look into alternative facilities. c/o Philip Rodrigues
MINOS • Extended outage of Tier 1(> 5 days). Again, an outage of days would not change much, but once the outage moved into weeks would start to consider alternative facilities. Small number of users makes it easy to communicate changes.
MICE MICE resilience plan in preparation. • Loss of CPU at Tier 1 would interfere with ability to tune beam and reduce efficiency by as much as 20-30% (beam tuning phase.) • Loss of ability to store data on tape would mean data taking coming to a halt once local storage was exhausted (4 days). Could be countered by copying data to multiple Tier 2 sites, unless disaster takes out network. • Network loss would mean inability to analyse data (T1 not used for analysis) and inability to store data at Tier 2 centres. • Hence network access is the highest priority, followed by the ability to write to tape. c/o Paul Kyberd
SiD Detector for ILC LOI Studies at T1 41M events Simulation Reconstruction LeptonID Vertexing
SiD • Loss of custodial data and T2 data. At the moment, only have 2 T1 – SLAC and RAL. No real T2 structure yet. Recovery strategy would be to copy all data from SLAC to RAL – bandwidth limited. • Compute/storage loss. As don’t have enough resources to have other centres take over, would lose compute power. At this point, it would probably take longer to coordinate a backup strategy than to wait for recovery of services. In case of foreseeable long term loss, response would be co-ordinated locally assuming SLAC has no free reserves to absorb extra demands. Highest priority would be to recover storage, as this is bottleneck in the VO. c/o Jan Strube
SiD Massive procurement failure. Excellent question, but no useful response at the moment. Extended outage of Tier 1(> 5 days). An extended outage would be pretty devastating. Would have to recover data from tape store at SLAC and ship it to computing centres with enough resources on conventional farms. At the moment, only limited by storage throughput. Loss of the UK T1 would cause a considerable delay in work. Taking the recent LOI efforts as an example, ~ half the benchmarking analyses would not have finished before the deadline. SiD Collaboration wish to thank RAL T1 team for all the help in their recent studies.
SuperNEMO • Loss of custodial data and T2 data. Currently not using T1 and mainly using T2s for cache storage. Little impact of data loss on VO operations. • Compute/storage loss. Response channelled through VO Admin. Implications are a massive slowdown of activities as would need to fall back on local clusters. • Massive procurement failure. Not yet thought about this scenario. • Extended outage of Tier 1(> 5 days). Currently, relying only on WMS at UK T1. Short-term – fall back to WMS’s in France for immediate activities. Long term – establish support from other WMS’s at other centres. Main comm. channel is GridPP/TB Support mailing lists - VO Admin – users. Some users may also be subscribed to GridPP-USERS list. c/o Gianfranco Sciacca
H1 Provided a list of UK site problems. • Scheduling time is too long for H1 (Oxford). After 6 hours staying in scheduled state, jobs are cancelled and resubmitted. Concern over VO priority. • Specific scheduling of jobs into running state (RAL, Birmingham). Certain queues show “one-by-one” or “two by two” submission. • Bad sites list (Brunel, Birmingham, RAL). Includes missing libraries, “forever running”, etc. • srm/LFC Catalogue Problem (QMUL, Oxford, IC). LFC entry exists, but physical file does not. Only seems to be a problem on UK sites. After a few cases, site is deleted from experiment list of queues. c/o Dave Sankey
CMS ATLAS One of “The Others” Final Comments Although LHC has priority, important to remember that “The Others” actually exist .... “The Others”have very limited manpower resources to deal with disasters and to “fire-fight”....