50 likes | 215 Views
Test Scenario 1A : The TSCE is operating in a normal mode. An application fault occurs such that some statically defined configuration is feasible for the post-fault resources. During the operation of the MRLM to recover from the fault, a failure in the MLRM is induced artificially.
E N D
Test Scenario 1A: The TSCE is operating in a normal mode. An application fault occurs such that some statically defined configuration is feasible for the post-fault resources. During the operation of the MRLM to recover from the fault, a failure in the MLRM is induced artificially. Observe whether the MLRM detects the fault, automatically selects a feasible statically defined configuration, and deploys it. Metrics: Does the MLRM deploy a static configuration? (Boolean) Time between the occurrence of the fault & restored operation using the statically defined configuration. Threshold: The DD(X) requirement for fault recovery as defined in the TBD Test Scenario 1B: The TSCE is operating in an MLRM-determined configuration following one or more application faults; this configuration is distinct from any statically defined configuration, but at least one statically defined configuration is feasible on the currently operational resources. A human operator signals the command to fall-back to a static configuration. Observe whether MLRM automatically selects a feasible statically defined configuration & deploys it. Metrics: Does the MLRM deploy a static configuration? (Boolean) Time between the issuance of the command and restored operation using the statically defined configuration. The DD(X) requirement for fault recovery as defined in the TBD Test 1: Do No Harm Purpose: To establish that the warfighter-visible behavior of the DD(X) Total Ship Computing Environment (TSCE) is never worse using the ARMS Multi-layer Resource Manager (MLRM) than it is using the baseline DD(X) static allocation mechanisms. Terminology: A feasible statically defined configuration is one that the baseline DD(X) system might deploy onto the currently operational hardware in the current system mode.
Test 1: Do No HarmElaborated Scenario 1A • The TSCE is operating in a normal mode. • Two resource pools (as defined in the DD(X) Release 3 System Acceptance Test Plan) • At least one mission critical application string is replicated across the two pools • At least one mission support application string is deployed without replication into the pool with the replica for the mission critical application string • This deployment represents an MLRM optimization • the static allocation placed the primary mission critical and the mission support string in the same pool. We configure MLRM to split them across different pools as a load balancing optimization • An application fault occurs such that some statically defined configuration is feasible for the post-fault resources. • The resource pool containing the mission support application string is failed catastrophically • The original static allocation remains valid after this failure • During the operation of the MRLM to recover from the fault, a failure in the MLRM is induced artificially. • MLRM detects that the mission support string is lost and initiates a redeployment into the remaining pool • A Resource Allocator is artificially forced to take to long to generate a new allocation • to represent the cost of a really difficult allocation problem without requiring the careful configuration of all the work-load generators and resource pools to construct such a problem • Observe whether the MLRM detects the fault, automatically selects a feasible statically defined configuration, and deploys it. • Allocation Execution Time Condition Monitor detects that allocation time is exceeding threshold and raised MLRM Allocation Fault event • Allocation Fault event Response Coordinator terminates dynamic allocation request and directs “best fit” static allocation to be used • Resource Allocator receives asychronous direction, aborts dynamic allocation, and selects best fit static allocation)
Data to Be Measured / Logged Deployment Completion Time (for each appstring) When an appstring is full deployed and executing normally Pool Failure Time When the failure is induced (not when detected) Pool Failure Detection Time When MLRM detects the pool failure MLRM Allocation Failure Time When MLRM detects that dynamic allocation has exceed threshold Allocation Decision Log (static and dynamic) Host Deployment Log (for each appstring) Approach to Compute Test Metrics Metric 1 Static Allocation Deployed (boolean) Deployment Completion Time, Allocation Decision Log and Host Deployment Log for restored mission support and the surviving mission critical appstring will be used to prove that the static allocation was used Metric 2 (time) Capability Restoration Time Difference between Deployment Completion Time for restored mission support appstring and Pool Failure Time is the metric Envisioned Test-bed Environment Variation of the Phase I, GM#2 Pool Failure Experiment 3 nodes per pool 2 pools 4 application strings Conduct test in Emulab environment Data Analysis Metrics can be directly computed from measured experimental data Test 1: Do No HarmExperiment Plan Scenario 1A
Test 1: Do No HarmElaborated Scenario 1B • The TSCE is operating in an MLRM-determined configuration following one or more applicationfaults; this configuration is distinct from any statically defined configuration, but at least one statically defined configuration is feasible on the currently operational resources. • Two resource pools (as defined in the DD(X) Release 3 System Acceptance Test Plan) • At least one mission critical application string is replicated across the two pools • At least one mission support application string is deployed without replication into the pool with the replica for the mission critical application string • This deployment represents an MLRM optimization • the static allocation placed the primary mission critical and the mission support string in the same pool. We configure MLRM to split them across different pools as a load balancing optimization • A human operator signals the command to fall-back to a static configuration. • A Static Allocation Reconfiguration Event is generated • Observe whether MLRM automatically selects a feasible statically defined configuration & deploys it. • Manual Reconfiguration Condition Monitor receives the Static Allocation Reconfiguration Event • Static Allocation Fallback Response Coordinator directs all affected Resource Allocators to reconfigure back to the static allocation • Each Resource Allocator reconfigures to static allocation • Determines deviations between current configuration and static configuration • Suspends string processing • Terminates execution of applications that deviate from static allocation (including cleaning up resource allocations) • Re-deploys terminate applications in static configuration (including making new resource allocations) • Restarts string processing
Data to Be Measured / Logged Deployment Completion Time (for each appstring) When an appstring is full deployed and executing normally Static Allocation Reconfiguration Event Time When war-fighter direction to fall back is given Allocation Decision Log (static and dynamic) Host Deployment Log (for each appstring) Approach to Compute Test Metrics Static Allocation Deployed (boolean) Deployment Completion Time, Allocation Decision Log and Host Deployment Log for restored mission support and the surviving mission critical appstring will be used to prove that the static allocation was used Metric 2 (time) Static Allocation Reconfiguration Difference between Deployment Completion Time for restored mission support appstring and Static Allocation Reconfiguration Event Time is the metric Envisioned Test-bed Environment Variation of the Phase I, GM#2 Pool Failure Experiment 3 nodes per pool 2 pools 4 application strings Conduct test in Emulab environment Data Analysis Metrics can be directly computed from measured experimental data Test 1: Do No HarmExperiment Plan Scenario 1B