1 / 9

Unix Watchdog in EPICS

Unix Watchdog in EPICS. Ken Brobeck, Jingchen Zhou Ron Chestnut. Motivation. Many Unix hosts supporting production System folk and application folk in different worlds Some failures require immediate attention Some less critical failures go unnoticed for days. Requirements.

sybilt
Download Presentation

Unix Watchdog in EPICS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unix Watchdog in EPICS Ken Brobeck, Jingchen Zhou Ron Chestnut EPICS Collaboration Mtg June 18-20, 2003

  2. Motivation • Many Unix hosts supporting production • System folk and application folk in different worlds • Some failures require immediate attention • Some less critical failures go unnoticed for days EPICS Collaboration Mtg June 18-20, 2003

  3. Requirements • One set of configuration files only(CPU, MEMORY, DISK, QUOTA, PING, PROCESS) • Application scripts create • EPICS database templates • EPICS alarm handler configurations • System scripts use configuration files to drive monitoring EPICS Collaboration Mtg June 18-20, 2003

  4. Piece of PROCESS configuration • TEMPLATE microname:PROC:@@@:PSSTATUS • # • # monitoring processes running on production nodes • # • # HOST Node:Process Account REGEX • opi00gtw00 XVFB ROOT 'Xvfb' • opi00gtw00 CMLOG CMLOG 'cmlogServer' • opi00gtw00 CMD MCCOPS 'cmdSrv' • opi00gtw01 XVFB ROOT 'Xvfb' • opi00gtw01 CMD MCCOPS 'cmdSrv' • opi00gtw01 SLCW45 MCCOPS 'CUD_LUM_SLCW45' • opi00gtw02 XVFB ROOT 'Xvfb' • opi00gtw02 CMD MCCOPS 'cmdSrv' • opi00gtw02 GTWPUB CDDEV 'gateway.*pep2pubpvs' • opi00gtw02 GTWPEP2 CDDEV 'gateway.*pep2peppvs' • opi00gtw02 CWP2RF CDDEV 'ChannelWatcher.*P2RF' • opi00gtw02 CWPEP CDDEV 'ChannelWatcher.*PEPII' • opi00gtw02 CWNLC CDDEV 'ChannelWatcher.*NLCTA' • opi00gtw02 CWPACK CDDEV 'ChannelWatcher.*PACK' • opi00gtw02 ALHPEP CDDEV 'alhSLAC.*pepii.*pepii' • opi00gtw02 ALHP2RF CDDEV 'alhSLAC.*pepii.*p2rf' • opi00gtw02 ALHNLC CDDEV 'alhSLAC.*pepii.*tarf' • opi00gtw04 CMD CDDEV 'cmdSrv' EPICS Collaboration Mtg June 18-20, 2003

  5. Piece of db template • file gpbApp/Db/uwd_process.db { • pattern { HN FN PN VN } • { CS00 PROC XVFB PSSTATUS } • { CS00 PROC CMLOG PSSTATUS } • { CS00 PROC CMD PSSTATUS } • { CS01 PROC XVFB PSSTATUS } • { CS01 PROC CMD PSSTATUS } • { CS01 PROC SLCW45 PSSTATUS } • { CS02 PROC XVFB PSSTATUS } • { CS02 PROC CMD PSSTATUS } • { CS02 PROC GTWPUB PSSTATUS } • { CS02 PROC GTWPEP2 PSSTATUS } • { CS02 PROC CWP2RF PSSTATUS } • { CS02 PROC CWPEP PSSTATUS } • { CS02 PROC CWNLC PSSTATUS } • { CS02 PROC CWPACK PSSTATUS } • { CS02 PROC ALHPEP PSSTATUS } • { CS02 PROC ALHP2RF PSSTATUS } • { CS02 PROC ALHNLC PSSTATUS } • { CS04 PROC CMD PSSTATUS } EPICS Collaboration Mtg June 18-20, 2003

  6. Piece of ALH configuration • GROUP Gateway2 PROC • $GUIDANCE • This group monitors individual processes • $END • CHANNEL PROC CS02:PROC:CHK:STAT ---T- • $ALIAS Process Monitor Deadman • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "PROCESS script not running on Gateway2" ${UWD_SYSTEMS} • $GUIDANCE • This channel checks the script for monitoring this function • $END • CHANNEL PROC CS02:PROC:XVFB:PSSTATUS ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "XVFB process not running on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE http://www.slac.stanford.edu/grp/cd/soft/unix/xvfb.html • CHANNEL PROC CS02:PROC:CMD:PSSTATUS ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "CMD process not running on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE http://www.slac.stanford.edu/grp/cd/soft/share/cmdSrv/index.html • $COMMAND ssh -x -f -T opi00gtw02 '$CD_APP/cmdSrv/script/st.cmdSrv.prod' • CHANNEL PROC CS02:PROC:GTWPUB:PSSTATUS ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "GTWPUB process not running on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE http://www.slac.stanford.edu/grp/cd/soft/share/slaconly/gateway/index.html • $COMMAND ssh -x -f -T opi00gtw02 'cd /nfs/nas1/log/gateway/pub; source gateway.restart' • CHANNEL PROC CS02:PUB:ACTIVE ---T- • $SEVRCOMMAND UP_MAJOR echo "UWD Major alarm" | mailx -s "PUB gateway running amok on Gateway2" ${UWD_RECIPIENTS} • $GUIDANCE • Gateway is running amok - too many active PVs • $END • CHANNEL PROC CS02:PUB:EXIST_TR ---T- • $GUIDANCE • Gateway is suspicious - existance test rate high • $END EPICS Collaboration Mtg June 18-20, 2003

  7. Alarm Handler screen EPICS Collaboration Mtg June 18-20, 2003

  8. Other features • Special DB to avoid alarming on short CPU usage peaks • Heartbeats from each monitoring process verify rest of data • ALH Process buttons restart processes on any host if necessary • Flexible mail lists for different classes of problems (Utilities, System, EPICS tools) EPICS Collaboration Mtg June 18-20, 2003

  9. Successes • More people understand more of the infrastructure • Less “us and them” between system and applications folk • Several people get mail on critical items • Less critical items do not go unnoticed • People can’t get away with cowboy releases (all see when processes are restarted) EPICS Collaboration Mtg June 18-20, 2003

More Related