150 likes | 409 Views
Experiment: Step by Step. Author: Anna Bekkerman abekkerm@ecs.umass.edu. Setup. Node. Client. LMM. Control signals. Node. Server. Data. Target system. LMM. Data. Node. LMM. Client. Configuration File. Describes an experiment Nodes
E N D
Experiment: Step by Step Author: Anna Bekkerman abekkerm@ecs.umass.edu
Setup Node Client LMM Control signals Node Server Data Target system LMM Data Node LMM Client
Configuration File • Describes an experiment • Nodes • IP addresses, types (SOCC node/radar node), etc. • Commands to start/stop involved processes • Collected metrics (CPU/memory utilization, etc.) • Monitored processes • Net control parameters • Delays, drop rates • Refresh rates
Start LMMs • When started, RAPIDS server: • Grabs two ports: • 49162 - to communicate with LMMs • 8888 - to communicate with RAPIDS clients • Reads a configuration file • Starts LMMs on all nodes through SSH connections • Waits for ack signals from all LMMs • Starts setting LMMs up according to the configuration file FIXME: Server will wait indefinitely for the acks from all LMMs. A time-out mechanism should be introduced.
Set LMMs Up • Home-made protocol is used to set up LMM parameters • Examples of commands sent from the server to LMMs: • STM set metric • STP set monitored process • STE set start-up command • STT start • SPP stop • When a parameter is set, LMM sends an ack signal back to the server • At the end of each step, server waits for acks from all LMMs
Start Monitoring • When LMM receives the start command: • If needed, network control application is started • Network control application runs only if iptables are turned on. • iptables select IP packets (as specified in iptables rules) and queue them for processing by the application. • The application introduces delays and/or drops packets according to the settings in the configuration file.
Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • A mechanism used for communication between RAPIDS and monitored applications. • See more in the “RMQ” section.
Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • Heartbeat applications are started • Send “I’m alive” signals from radar nodes to SOCC nodes. • If a signal has not been received, RAPIDS reports link failure. • FIXME: Timeout mechanism should be added to minimize false alarms.
Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • Heartbeat applications are started • Processes are started • Commands are specified by user in the configuration file
Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • Heartbeat applications are started • Processes are started • Commands are specified by user in the configuration file • “Collection sessions” are started every t seconds • According to the refresh rates provided by user in the configuration file
Collection Session • During each collection session LMM: • Collects metrics • Reads events accumulated in RMQ • Sends the metrics and events to the RAPIDS server • More details in the “LMM” section
Stop Monitoring • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped
What Might Go Wrong? • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped If “untrappable” signals (SIGKILL and SIGSTOP) are used to kill the server, the shut-down procedures will not be executed!
What Might Go Wrong? • If commands provided by user do not stop all processes, LMM will hang waiting for their termination. • While an LMM is hanging the port used for communication with the server remains unreleased, which means that the new experiment cannot be started until LMMs are stopped and all necessary clean-up procedures have been completed. • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped
What Might Go Wrong? • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped • FIXME: • These applications do not always react to the termination signal properly. • Symptom: sometimes a number of zombie processes appear