140 likes | 152 Views
IV&V Lessons Learned Mars Exploration Rovers and the Spirit SOL-18 Anomaly: NASA IV&V Involvement August 2004. Kenneth Costello Senior IV&V Manager NASA IV&V Facility 100 University Dr Fairmont, West Virginia 26508 Kenneth.A.Costello@nasa.gov 304 367 8343. Introduction. Purpose:
E N D
IV&V Lessons Learned Mars Exploration Rovers andthe Spirit SOL-18 Anomaly:NASA IV&V InvolvementAugust 2004 Kenneth Costello Senior IV&V ManagerNASA IV&V Facility 100 University Dr Fairmont, West Virginia 26508 Kenneth.A.Costello@nasa.gov 304 367 8343
Introduction • Purpose: • This is an information presentation to provide a quick overview of IV&V and provide some lessons learned for IV&V from the Mars Exploration Rover project • Agenda • Overview of NASA IV&V • Background on IV&V involvement with the MER program • IV&V issues related to the system memory and file system • IV&V Lessons Learned • Summary
What is NASA IV&V? • NASA IV&V is a program managed by the Safety and Mission Assurance Office • The program is delegated to the Goddard Space Flight Center and is managed from the NASA IV&V Facility in Fairmont, West Virginia • Facility was dedicated in 1994 • Focus on unmanned missions began around 2000 • The program has two main roles for the Agency • The first role is to provide an Independent V&V capability and ensure mission software readiness for critical projects focused around risk and safety • The second role is to enhance software readiness by providing IV&V domain expertise to Projects to identify issues/defects and propose possible solutions
E2 C2 C1 C3 E3 Scope DeterminationSoftware Integrity Level Assessment Process For each Software Component: Final values plotted on a 5x5 matrix for reference. Values are also used to cross-reference a task matrix. Criticality: Rating Category Human Safety 1-5 score where 5 is highest Criticality Asset Safety Performance Each factor has an associated weight applied before being combined into a final value Error Potential: Rating Category Development Organization E1 1-5 score where 5 is highest Error Potential Development Process Software Characteristics Task selection is from a standardized list of tasks. Allocation is based on criticality value and on error potential value individually.
IV&V Lifecycle Flow Focused activity at the earliest point System requirements and software role important Issues are introduced at lowest level Concept Phase Verification Covers all levels of testing Ensure that system meets the needs of the mission System Requirements Software Planning Verification Verification IV&V in phase with development (not testing only) Validation Testing Software Requirements Verification Design Verification Simulator/ Environment/ Hardware Implementation Verification Maintenance Later life cycle activity still important Issues are still introduced at lowest level Focused more on individual components IV&V support continues over initial operational phase
IV&V Activities for MER • Initial assessment of the MER project performed in June 2001 • Results of assessment noted that the file system was a very critical portion of the FSW, however, the scores for the technology being used and the maturity of the software indicated low risk • Some portions were rated as high complexity • Overall the file system software was within the IV&V scope though at a low level • Initial estimate of the IV&V resources was 9-10 FTEs • The MER Project had not budgeted for that level of IV&V resources • Final IV&V resources were 4-5 FTEs • Reduction in resources necessitated changes in the approach to IV&V • Goal was to cover the MER FSW to a reasonable depth so that the IV&V Team could feel comfortable supporting launch and operational readiness reviews for the project • Tasking was “pulled up” to a higher level than normal – analysis applied at a complete FSW level rather than at a software component level • Additional issue in regards to a limited number of FSW requirement artifacts
Summary of Spirit Sol-18 System Memory Consumption • Sol 18 • 9:00 LST – The planned DTE HGA communication session began. • ~9:11 LST – Event Reports were received indicating uplink errors were occurring. Downlink was spotty. • ~9:16 LST – The signal was lost. This was ~14 minutes earlier than expected • 11:20 LST – Commanded a 30-minute high priority HGA communication session. No signal was seen. • 12:45 LST – Commanded an LGA beep. The beep occurred as predicted (start and duration). • 16:18 LST: Odyssey UHF pass over Spirit, no carrier seen • Sol 19 • 1:45 LST – The MGS UHF communications session lasted only 2 minutes and 20 seconds. It did start at the correct time but only a repeating PsuedoNoise code was present in the data. • 4:39 LST – No early morning UHF communication session with the Odyssey spacecraft (no signal or data). • 9:00 LST – No morning HGA DTE communication session. No signal or data were detected. • 11:00 LST – Looked for 10 bps LGA DTE communication session initiated by a system fault protection response. No signal was seen. • 14:40 LST – Commanded beep at 7.8125 bps. Beep was seen! • 15:24 – No afternoon UHF communication session with the Odyssey spacecraft (no signal or data). • 15:27 – Attempted to command an LGA DTE communication session. No signal or data was received. • A system level fault had occurred on Sol 19 that put the rover in a degraded communication state and allowed some commanding • Eventually, JPL was able to determine that FSW was in a continuous delayed reset loop. The first reset seemed to occur during the Sol 18 morning DTE session coincident with an actuator checkout • Both commanded and autonomous shutdowns were failing and the vehicle probably had not shutdown in a while
Root Cause • The root cause was traced to two configuration parameters in the VxWorks operating system • Configuration parameters of the dosFsLib module3 permitted the unbounded consumption of memory from the system memory heap as the FLASH file system was populated with an increasing number of files • The configuration parameters of the memPartLib module4 were set so that the logic would suspend the execution of any task that requested memory when no additional memory was available • This had the undesirable effect of suspending a critical task when the memory space was exhausted • Other effects included memory corruption, inability to turn vehicle off (due to task deadlock), repeating system resets • Contributing factors included the compressed development schedule, unanticipated behavior of the FSW, incomplete development (analysis of the effects of the dosFsLib parameters was never fully completed), test program was not equivalent to operational use, and inadequate telemetry
IV&V Findings Related to the System Memory • Requirement and test completeness • IV&V Risk #1 on Requirements (and extended to include test) was remaining risk in “Significant Concern” status at time of upload • Chief concern was that software requirements discovery was not complete and that software had not been adequately tested at the time of the upload • Specific TIM’s • Specific TIMs were written against the insufficient unit tests for portions of the file system using the system memory • Project asserted testing was complete but without documentation • These TIMs were still in “Open” state at the time of the final upload • Code Complexity • Portions of the file system using the system memory was consistently reported to be very complex • Modules were reported to have poor testability and poor maintainability • Code Stability • File system modules were being worked on until the last release (R8.1d, 11/20/03) • File Meta Engine had 10% of its total code changed as late as Release 8.0, and had 9% of its total code changed for Release 8.1 • Note that the file system was not the cause of the problem, but brought the lack of memory to light and created the task deadlock
IV&V Concerns over Requirements & Test • Upload Readiness Review (11/25/03) • Plans were to upload final FSW on 12/2/05; review was to determine readiness • IV&V recommended further testing before upload, delaying upload past Dec 2 • Operational Readiness Review (12/5/03) • “Aggregate of requirement and test issues represent a risk being tracked in IV&V Risks” • Final Requirements Risk status was “Significant Concern” (middle of three possible levels) • IV&V Concern: “There remains an IV&V concern about the possibility of requirements-related surprises during operations. IV&V has a less optimistic view of the requirements discovery than does the project.” • Potential Consequence for Surface ops: “Possible loss of science return” (“Possible loss of science return” means the situation we are currently seeing: significant time to detect, understand, and correct problems on the surface) • Reiteration of 11/25/03 IV&V recommendation for further testing before upload (which by 12/5/03 had already occurred, the project having proceeded with planned upload on 12/2/03) • Recommendation to “Continue testing to the extent possible” • Recommendation to “Ensure test results are adequately reviewed” • Project emphasis on “test as you fly” (vs. formal unit and requirements-based tests) didn’t find the problem
IV&V Lessons Learned • Resources • The low level of resources being applied to such a large and complex project was not sufficient • The goal of analyzing the software at a depth that would allow the IV&V Team to feel confident when supporting project readiness reviews had to be maintained • Forced a shift from a software component approach to a more whole system approach • Resources for IV&V should be such that a software component approach can be maintained throughout a project SDLC • Lack of Artifacts • Current IV&V Facility processes are very requirements driven • The lack of FSW requirements artifacts on the MER Project affected the IV&V work being performed and also helped to move the approach away from a component level analysis • Additionally projects are not generally required to follow a standardized software development life cycle • The IV&V Facility needs to examine its requirements driven approach and generate some alternative approaches to performing IV&V on projects lacking software artifacts
IV&V Lessons Learned • Pursuing Risks • Early on the IV&V Team documented the requirements risk • Project would only address specific problems that were realization of the risk not the risk itself with the IV&V Team • Otherwise, the planned testing program mitigated the risk in the project’s eyes • The IV&V Team was still concerned, but the lack of FSW requirements made it difficult to fully examine the consequences and likelihood of the risk • The IV&V Team eventually accepted the test program as a mitigation to the risk • However as milestone reviews neared, the testing in some cases had not been completed • The project continued testing up to the last minute • Additionally, the lack of requirements artifacts placed the MER Project into the position of testing with incomplete requirements • Testing was driven more by scenarios generated by system engineers such that they felt that the system was fully exercised – IV&V had no insight into how the scenarios were developed • The IV&V Team needs to be more proactive in assessing mitigation efforts early in the SDLC so as to more effectively support projects • Additionally projects should enforce and follow good software engineering practices that includes good requirements development to support a mature test program
IV&V Contributing Factors • The IV&V Team needs to be intimately involved with the development team • The MER project’s compressed schedule created a schedule risk from outside parties • The IV&V team was not able to work directly with the developer • Additionally there was no access to the development issue database or the low level testing artifacts that would allow IV&V to perform a more in-depth analysis • Projects need to integrate the IV&V process into the development process in order to gain maximum advantage of the resources being offered • Need to monitor relationship to ensure that “independence” is not lost • More specific attention to COTS products • The root cause in this case was the incorrect use of a COTS product • The IV&V team usually analyzes the use of and interfaces between COTS and developed code since the content of most COTS products is not visible • The IV&V team was not able to perform that level of analysis on this mission due to resource constraints
Summary • The IV&V approach was modified based on various project specific factors that caused the analysis approach to be elevated to a full system approach rather than the normal software component approach • Even at the full system approach, the IV&V team identified potential troubling areas involving the system memory usage: risk tracking, issue tracking, code analysis, requirements analysis, test analysis, code complexity, and code stability • However, the lack of complete requirements documents and testing documentation, both identified by IV&V as project deficiencies, hindered finding the specific problem prior to upload • The IV&V Facility is examining the lessons learned to determine what actions to take to ensure better service on other IV&V projects