120 likes | 249 Views
ATLAS Site Status Board Automatic queue exclusion based on downtimes. ATLAS site topology Site exclusion algorithm Test results First real exclusion and recovery. C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth
E N D
ATLAS Site Status Board Automatic queue exclusion based on downtimes • ATLAS site topology • Site exclusion algorithm • Test results • First real exclusion and recovery C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth atlas-adc-ssb-devs@cern.ch 21st Feb 2012
ATLAS site topology • Based on information from AGIS, Schedconfig • Mapping between various ATLAS site naming conventions • AGIS (based on GOCDB/OIM), Panda, DDM • Populated “exception file” • ATLAS site-oriented topology • http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json • ATLAS Panda queue-oriented topology • http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues.json • http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues_dict.json • In touch with Pilot factory monitoring developers to get mapping between queues and resources as Pilot factories see it • Will enable us to map ANALY queues to downtimes of CE 21st Feb 2012
Site exclusion • Queue exclusion based on downtime of a SE, CE (, LFC) • Exclusion tools has undergone thorough testing before was put into production for the first queues GOCDB OIMDB Site A SE downtime starts AGIS Site downtime information DDM exclusion collector Fetches SE downtime from AGIS Site A: SE SE Excluded Site exclusion collector Fetches SE/CE/LFC downtime from AGIS Site C SE downtime starts Site B SE downtime over Site E CE(s) downtime starts Site D LFC downtime starts Site C: CE CEs Excluded Site B: SE SE Recovered Site E: CE CE(s) Excluded Site D: SE SE(s) Excluded Site D: CE CE(s) Excluded In production In testing phase 18 Oct 2011 21st Feb 2012
Site exclusion algorithm • Fetch ongoing and future downtimes from AGIS • Map downtimes from sites to queues (topology!) • SRM downtime: action with every queue type (ANALY, prod) • CE downtime: action only with prod queues • Decide exclusion/recovery action, consider • time of downtime • queue type (production, analysis, “special”) • current queue status • current queue comment 21st Feb 2012
Exclusion of a production queue • 12 hr in advance of a downtime: • setoffline with comment “set.offline.by.SSB” if queue is: • Online with any possible comment • Brokeroff with comment “set.brokeroff.by.SSB” • Test with comment “HC.Test.Me” • Otherwise do not touch that queue! • When downtime starts: • Make sure that queue is set offline when appropriate • See the rules above, in the T-12h .. T intervals • End of downtime/downtime disappears – recovery: • settest with comment “HC.Test.Me” if the current status is Offline with comment “set.offline.by.SSB” • Otherwise do not touch that queue! 21st Feb 2012
Exclusion of an analysis queue • 6 hr in advance of a downtime: • setbrokeroff with comment “set.brokeroff.by.SSB” if queue is: • Online with any possible comment • Brokeroff with comment “set.brokeroff.by.SSB” • Offline with comment “set.offline.by.SSB” • Otherwise do not touch that queue! • 2 hr in advance of a downtime and during downtime: • setofflinewith comment “set.offline.by.SSB” if queue is: • Online with any possible comment • Brokeroff with comment “set.brokeroff.by.SSB” • Test with comment “HC.Test.Me” • Otherwise do not touch that queue! • End of downtime/downtime disappears – recovery: • settest with comment “HC.Test.Me” if the current status is Offline with comment “set.offline.by.SSB” • Otherwise do not touch that queue! 21st Feb 2012
Testing the exclusion idea - 1 • Assembled test data: • 2 flavours of production queues (only 1 enabled), • 2 flavours of analysis queues (only 1 enabled) • Phase space of queue status contains every possible combination of [queue type, queue status, queue comment]: • FAKE_QUEUE_TYPES (x) FAKE_QUEUE_PREFIXES (x) (x) FAKE_STATES (x) FAKE_COMMENTS, where • FAKE_QUEUE_TYPES=[DEFAULT_QUEUE_TYPE_PRODUCTION, DEFAULT_QUEUE_TYPE_ANALYSIS, DEFAULT_QUEUE_TYPE_SPECIAL] • FAKE_QUEUE_PREFIXES={DEFAULT_QUEUE_TYPE_PRODUCTION: ['testsite-testsitece02-at2testsite-pbs_test', 'testsite-testsitece03-at2testsite-pbs_test'], DEFAULT_QUEUE_TYPE_ANALYSIS:['ANALY', 'ANALY2'], DEFAULT_QUEUE_TYPE_SPECIAL:['SPECIAL1', 'SPECIAL2']} • FAKE_STATES=['online', 'offline', 'test', 'brokeroff'] • FAKE_COMMENTS=['', 'dummy', 'set.offline.by.SSB', 'set.offline.by.SSB.dummy', 'set.brokeroff.by.SSB', 'set.brokeroff.by.SSB.dummy', 'set.online.by.SSB', 'set.online.by.SSB.dummy', 'HC.Test.Me', 'HC.Test.Me.dummy'] 21st Feb 2012
Testing the exclusion idea -2 • “Dashboard” with the timeline for each queue class from the phase space http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.html • Log with detailed actions described http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.log • Test downtimes: • SRM: from 2012-02-05 23:30 UTC to 2012-02-06 02:00 UTC • SRM: from 2012-02-06 04:30 UTC to 2012-02-06 06:00 UTC • SRM: from 2012-02-07 04:30 UTC to 2012-02-07 06:00 UTC • CE: for each queue from 2012-02-06 8am 9am UTC • The exclusion algorithm does what is expected and when it is expected! 21st Feb 2012
Real actions • After thorough testing and improving log debugging features for operations • We started taking real actions for several queues • https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/33952 • The exclusion tool does what is expected and when it is expected! • Tested with ifae and UKI-SCOTGRID-DURHAM, which have downtimes today. • Next in the pipeline is SFU-LCG2. 21st Feb 2012
Operational experience - 1 • Every action is logged, so it’s easier to debug what went wrong if this occur. • http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher.log • Found few minor issues on the way • Fetched only future downtimes from AGIS. • Fixed. Now fetching ongoing and future downtimes. • Disabled all real queues for the past night • Fixed. Now all queues from elog:33952 are enabled again. • The exclusion tool takes only actions we intend it to take! 21st Feb 2012
Operational experience - 2 • Found few minor issues on the way • Fetched only future downtimes from AGIS. • Fixed. Now fetching ongoing and future downtimes. • Disabled all real queues for the past night • Fixed. Now all queues from elog:33952 are enabled again. • The exclusion tool takes only actions we intend it to take! 21st Feb 2012
Summary • Using ATLAS site topology • http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json • First real exclusions and recoveries successful! • Next steps: • Add more queues to real actions • Add more configurability (now: system-wide) Questions? atlas-adc-ssb-devs@cern.ch 21st Feb 2012