E N D
Technology Infusion: Text-Mining and Tagging for Software Change RequestsExecutive BriefingJane T. Malin and David R. ThroopNASA Johnson Space Center (JSC)Project: Technology Infusion of Text-mining for Problem Trending into Software Change Reports at JSCSoftware Assurance SymposiumSeptember, 2008
The Problem/NASA Relevance International Space Station generates ~1400 Software Change Requests (SCR) annually It is difficult to find trends and recurring anomalies within the large set of SCRs. • Particularly urgent when trying to find ‘more reports similar to this one’ during flight anomalies • Typical “manual” analysis uses database searches • Critical information about software changes is captured in natural-language text fields (English sentences.) • Text is not well behaved, so keyword search or data mining approaches fail • Syntactic and semantic variants are used often
Approach • Leverage Text-Mining technology used to: • Extract model parts for system modeling from requirements • Find trends in Discrepancy Reports Semantic Text Mining and Tagging • Analyzes sets (10,000s) of problem-report records from databases • Each record has multiple fields, some of which contain English-language text describing problems, causes, consequences, equipment. • Text-mining approach • Performs syntactic parsing of each text field in the data record • Uses hierarchical aerospace ontologies of concepts and nomenclature to identify problem-type or equipment-type tags to add to each record • Searches for word-patterns that match problems or entities of interest • Adds additional tag fields to records • Uses tags for graphs and other browsing capabilities for analysts
Current Capability • User: ISS Robotics • 3200 .html SCR records to converted tab-delimited format • Text analysis and hierarchical tagging for problem types • Capability to limit tagging scope to only software failures • Analysis of multiple fields • Improved bar chart formats Errors co-occurring with ‘Deactivation’ in one year
Current Software Problems • Software problem type hierarchy from Aerospace Ontology, with mapping words • Software_Threat: spyware, spam, virus, malware, worm, Trojan horse, Trojan, root kit, exploit, ping, brute force attack, dictionary attack, replay attack, piggybacking, denial of service, sabotage • Programmer_Error: programmer error, {Bad} programming practice • Software_or_Computer_Error (error, faulty): software error, software problem, BIT error, controller error, computer error, display error, program error, bit count error, check error, not reinitialized, compiler error, bug, phase error, exception, {Programming_Language} exception, page fault, general protection fault, halt failure, crash • Software_Security_Anomaly: protocol anomaly, traffic anomaly • Software_Sequence_Error: command sequence error, task sequence error, boot sequence error, function sequence error, sequence error • Software_Resource_Contention: thrashing, unwanted synchronization, multithread error, deadlock, live lock, lock error, contention, race condition, data race • Data_Error: data error, bit error, parity error, missing pointer, i/o error, input error, input/output error, output error, word error, divide by zero • Corruption: corrupted packet, corrupt file • Memory_Error: corrupted memory, memory write error, memory error, read error, integer overflow, buffer overflow, memory leak, {Insufficient} memory, overwritten memory, overwrite, write over, write on top of • Software_Vulnerability: dangling pointer, format string vulnerability, code injection, intrusion, hijack • Bad_Software_Structure:{Bad} {Software_Structure} • Missing_Software_Structure:{Missing} {Software_Structure} • Software_Not_Responding: crash, hang up, lock up, freeze • Note: Brackets expand. For example, {Software_Structure} expands to: comment, code, dictionary, expression, statement, instruction, computation, algorithm, string, thread, pointer, link, hyperlink, reference, command sequence, error log, DLL, load, software load, dump data, segment, use-define chain, call graph, control flow graph, handler
Technical Challenges Software module names identify system failure modes E.g. don’t tag Fire-in-cabin annunciation as ‘FIRE’. Handled by tagging only software-related failures Usability Challenges Determining what trends are most useful User interviews Repeated prototypes Redesigning user displays to accommodate information overload Technical Challenges
Planned Capability • Additional iteration of suggestions and refinement of requirements • More software failure terms and concepts in the tagging ontology • Support for identifying and eliminating false positives • Documented user requirements and capabilities • Proposal for wider use by many JSC organizations that search and analyze SCR database records • Including tighter integration with current SCR database, linking back to it