210 likes | 401 Views
NICI CDR. Software Issues. Ref Section 15. Concurrency Issues and Error Handling Porting issues for DHS libraries Version Control Network Topology. Tightly Coupled System Communicates by way of shared memory. Loosely Coupled System
E N D
NICI CDR Software Issues
Ref Section 15 • Concurrency Issues and Error Handling • Porting issues for DHS libraries • Version Control • Network Topology
Tightly Coupled System Communicates by way of shared memory. Loosely Coupled System No shared memory available -- must use I/O capability to communicate P and V semaphore operations defined by E. Dijkstra (Railroad analogy) Concurrency IssuesTerminology
Pitfalls of Concurrent or Threaded Systems Real Time, Tightly coupled and MVC GUI Shared Memory Communication Races, Dumb Data, out-of-sync #include <file.h> P and V semaphores Must all refer to same data structure. -- bad scouts Partial Restart is an elusive goal in tightly coupled systems -- too much shared state! Concurrency Issues and Error Handling
Equivalent Tools for Synchronization primitives: Semaphores (P and V -- Dijkstra) Guards (Monitors -- P. B. Hansen or N. Wirth) Message Queues (Coffman & Denning) Message Queues not only are easiest to use, but are the best choice for loosely coupled systems! Unix Pipes and TCP/IP Sockets are Queues! Concurrency Issues
Lost Messages detected by numbering the messages, use TCP not UDP Dead Process It stops communicating, looks like lost communication, can be harvested and restarted by watchdog. Error Types in Loosely Coupled Multi-Processor Systems
Continued: Socket Failure close, re-open and check if other side has re-booted upon re-connect Reboot or power fail in a single processor OS will make all sockets fail, looks like Socket failure. Error Types in Loosely Coupled Multi-Processor Systems
Inability to maintain throughput. Exceeding the rated throughput is not a system failure, but a system behavior. Throughput capacity needs to be measured as within contractual needs. NICI’s internal buffers are very large and would tolerate periods of overload (10 - 30 minutes) of back-end storage capacity. Error Types in Loosely Coupled Multi-Processor Systems
Avoid shared memory communication Dumb data is at mercy of any code that can see it. P and V is too low level (like GOTO) Complete restart is more thorough and reliable than partial restart. Keep-alive (dummy) communication shortens time to detect errors but does not increase the reliability, and may add overhead to already highly loaded system. Concurrency Issues and Error Handling -- Lessons Learned
Reduce the amount of shared state Less data to pass over socket, less to recover on error, less accounting Use Client -- Server where possible Clean separation of function - no duplication of effort as with peer-to-peer Force all errors as equivalent to communication If system can recover from gravest error, it can recover from anything. Restart should look like reboot/warm boot. Simple and reliable recovery of missing state. Successful paradigms for reliability in loosely coupled systems
Error Recovery and Restart for Minimal Down Time -- IC • Detect error condition or command; log it. • If Instrument Level error then shut down AO, Pixel Servers with abort commands. • Enter Error State, wait for recover command from IS (or engineer) • Make the slaves re-start: looking like catastrophic failure.
Command Object States Idle Command side Action side complete parser Activity request active reject Activity steps accept finis Recovered Miscommuni- cation Configure, report or activate. Any state recovery error Recover command
Error Detection Placement • System Calls -- Is the OS performing as desired? • Input messages and Events -- Is the peer process performing as expected? • Watchdogs and Time-outs -- Is it a “Failure to Communicate?” • Internal Assertions -- Did the programmer remember all the boundary conditions?
Error Detection -- IC & Pixel Server Out Clocker Image router Out Image router Multiple clients • System Calls • Input messages Instrument Controller client I S I S S S I Pixel Creators DHS & Quick Look I S Pci device driver Image buffer router S S S I S Disk FITS file I
Error Detection -- IC & Pixel Server Out Clocker Image router Out Image router Multiple clients • Time-outs • Internal Assertions Instrument Controller client T I I T T I Pixel Creators DHS & Quick Look I T Pci device driver Image buffer router T I Disk FITS file I
Error Recovery and Restart for Minimal Down Time: AO or PS • Detect error condition or abort command; log it. • Stop all I/O and clear all buffers. • Kill self, allow restart of process (use INIT). • New process will enter idle state • If in initial idle state and IC asks for abort, just say OK, reboot not needed
Version Control via CVS with Web access via CVSWEB for management visibility Releases tagged on a per feature basis: Ex. TTSM, DOFF, Stress Table, Atm. Corr. Alpha and beta tags for internal and integration testing releases Development Method and Version Control
Porting Issues for DHS libraries • DHS porting is under auspices of Gemini Staff and Management • Questions regarding porting issues should be directed to the acknowledged experts.
Maximum Visibility of internal addresses for engineering or maintenance use. Power supply IP addresses visible for remote power control. Internal net switches to minimize cabling and enhance network interference 1 (IS) + 14 IP addresses for NICI on control network, 2 IP addresses on DHS Network Topology
IP Network Topology IS Power 1 IC TIC Clocker 2 IP Clocker 2IP TIC E-net SW1 CTL 6netv E-net SW2 Power2 PS1 PS2 6net AO All visible as 1 (IS) + 14 IP Control 2 IP DHS DHS