90 likes | 104 Views
Unix Watchdog in EPICS. Ken Brobeck, Jingchen Zhou Ron Chestnut. Motivation. Many Unix hosts supporting production System folk and application folk in different worlds Some failures require immediate attention Some less critical failures go unnoticed for days. Requirements.
E N D
Unix Watchdog in EPICS Ken Brobeck, Jingchen Zhou Ron Chestnut EPICS Collaboration Mtg June 18-20, 2003
Motivation • Many Unix hosts supporting production • System folk and application folk in different worlds • Some failures require immediate attention • Some less critical failures go unnoticed for days EPICS Collaboration Mtg June 18-20, 2003
Requirements • One set of configuration files only(CPU, MEMORY, DISK, QUOTA, PING, PROCESS) • Application scripts create • EPICS database templates • EPICS alarm handler configurations • System scripts use configuration files to drive monitoring EPICS Collaboration Mtg June 18-20, 2003
Piece of PROCESS configuration • TEMPLATE microname:PROC:@@@:PSSTATUS • # • # monitoring processes running on production nodes • # • # HOST Node:Process Account REGEX • opi00gtw00 XVFB ROOT 'Xvfb' • opi00gtw00 CMLOG CMLOG 'cmlogServer' • opi00gtw00 CMD MCCOPS 'cmdSrv' • opi00gtw01 XVFB ROOT 'Xvfb' • opi00gtw01 CMD MCCOPS 'cmdSrv' • opi00gtw01 SLCW45 MCCOPS 'CUD_LUM_SLCW45' • opi00gtw02 XVFB ROOT 'Xvfb' • opi00gtw02 CMD MCCOPS 'cmdSrv' • opi00gtw02 GTWPUB CDDEV 'gateway.*pep2pubpvs' • opi00gtw02 GTWPEP2 CDDEV 'gateway.*pep2peppvs' • opi00gtw02 CWP2RF CDDEV 'ChannelWatcher.*P2RF' • opi00gtw02 CWPEP CDDEV 'ChannelWatcher.*PEPII' • opi00gtw02 CWNLC CDDEV 'ChannelWatcher.*NLCTA' • opi00gtw02 CWPACK CDDEV 'ChannelWatcher.*PACK' • opi00gtw02 ALHPEP CDDEV 'alhSLAC.*pepii.*pepii' • opi00gtw02 ALHP2RF CDDEV 'alhSLAC.*pepii.*p2rf' • opi00gtw02 ALHNLC CDDEV 'alhSLAC.*pepii.*tarf' • opi00gtw04 CMD CDDEV 'cmdSrv' EPICS Collaboration Mtg June 18-20, 2003
Piece of db template • file gpbApp/Db/uwd_process.db { • pattern { HN FN PN VN } • { CS00 PROC XVFB PSSTATUS } • { CS00 PROC CMLOG PSSTATUS } • { CS00 PROC CMD PSSTATUS } • { CS01 PROC XVFB PSSTATUS } • { CS01 PROC CMD PSSTATUS } • { CS01 PROC SLCW45 PSSTATUS } • { CS02 PROC XVFB PSSTATUS } • { CS02 PROC CMD PSSTATUS } • { CS02 PROC GTWPUB PSSTATUS } • { CS02 PROC GTWPEP2 PSSTATUS } • { CS02 PROC CWP2RF PSSTATUS } • { CS02 PROC CWPEP PSSTATUS } • { CS02 PROC CWNLC PSSTATUS } • { CS02 PROC CWPACK PSSTATUS } • { CS02 PROC ALHPEP PSSTATUS } • { CS02 PROC ALHP2RF PSSTATUS } • { CS02 PROC ALHNLC PSSTATUS } • { CS04 PROC CMD PSSTATUS } EPICS Collaboration Mtg June 18-20, 2003
Piece of ALH configuration • GROUP Gateway2 PROC • $GUIDANCE • This group monitors individual processes • $END • CHANNEL PROC CS02:PROC:CHK:STAT ---T- • $ALIAS Process Monitor Deadman • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "PROCESS script not running on Gateway2" ${UWD_SYSTEMS} • $GUIDANCE • This channel checks the script for monitoring this function • $END • CHANNEL PROC CS02:PROC:XVFB:PSSTATUS ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "XVFB process not running on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE http://www.slac.stanford.edu/grp/cd/soft/unix/xvfb.html • CHANNEL PROC CS02:PROC:CMD:PSSTATUS ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "CMD process not running on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE http://www.slac.stanford.edu/grp/cd/soft/share/cmdSrv/index.html • $COMMAND ssh -x -f -T opi00gtw02 '$CD_APP/cmdSrv/script/st.cmdSrv.prod' • CHANNEL PROC CS02:PROC:GTWPUB:PSSTATUS ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "GTWPUB process not running on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE http://www.slac.stanford.edu/grp/cd/soft/share/slaconly/gateway/index.html • $COMMAND ssh -x -f -T opi00gtw02 'cd /nfs/nas1/log/gateway/pub; source gateway.restart' • CHANNEL PROC CS02:PUB:ACTIVE ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "PUB gateway running amok on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE • Gateway is running amok - too many active PVs • $END • CHANNEL PROC CS02:PUB:EXIST_TR ---T- • $GUIDANCE • Gateway is suspicious - existance test rate high • $END EPICS Collaboration Mtg June 18-20, 2003
Alarm Handler screen EPICS Collaboration Mtg June 18-20, 2003
Other features • Special DB to avoid alarming on short CPU usage peaks • Heartbeats from each monitoring process verify rest of data • ALH Process buttons restart processes on any host if necessary • Flexible mail lists for different classes of problems (Utilities, System, EPICS tools) EPICS Collaboration Mtg June 18-20, 2003
Successes • More people understand more of the infrastructure • Less “us and them” between system and applications folk • Several people get mail on critical items • Less critical items do not go unnoticed • People can’t get away with cowboy releases (all see when processes are restarted) EPICS Collaboration Mtg June 18-20, 2003