Testing Challenges for Next-Generation CPS Software

Testing Challenges for Next-Generation CPS Software Mike Whalen University of Minnesota

Acknowledgements • Rockwell Collins: Steven Miller, Darren Cofer, Lucas Wagner, Andrew Gacek, John Backes • University of Minnesota: Mats P. E. Heimdahl, Sanjai Rayadurgam, Matt Staats, Ajitha Rajan, Gregory Gay • Funding Sponsors: NASA, Air Force Research Labs, DARPA

Who Am I? My main aim is in reducing verification and validation (V&V) cost and increasing rigor • Applied automated V&V techniques on industrial systems at Rockwell Collins for 6 ½ years • Proofs, bounded analyses, static analysis, automated testing • Combining several kinds of assurance artifacts • I’m interested in requirements as they pertain to V&V. Main research thrusts in testing • Factors in testing: how do we make testing experimentsfair and repeatable? • Test metrics: What are reasonable metrics for testing safety-critical systems? • What does it mean for a metric to be reasonable?

Software Size Graphic: Andrea Busnelli

Slide courtesy Lockheed Martin, Inc.

Software Connectivity

Networked Vehicles Currently: Bluetooth and OnStar Adaptive Cruise Control Platooning Traffic Routing Emergency Response Adaptive traffic lights What could possibly go wrong? Image courtesy of energyclub.stanford.edu

Attacks on Embedded Systems Poland Tram Stuxnet FBI - iPhone Miller: Remote Car Hack

Hypotheses CPS testers are facing enormous challenges of scale and scrutiny • Substantially larger code bases • Increased attention from attackers Thorough use of automation is necessary to increase rigor for CPS verification • Requires understanding of factors in testing • Common coverage metrics are not as well-suited for CPS as for general purpose software • Structure of programs, oracles is important for automated testing! Creating intelligent / adaptive systems will make the testing problem harder • Use of “deep learning” for critical functionality • We have little knowledge of how to systematically white-box test deep-learning generated code such as neural nets

Testing Process Test Suite Test Inputs Executed On Specification Implements Model/Program Create Additional Evaluated by Oracle Program Path Assess Correct/incorrect Test Coverage Metric

Testing Artifacts J. Gourlay. A mathematical framework for the investigation of testing. TSE, 1983 Staats, Whalen, and Heimdahl, Programs, Tests, and Oracles: The Foundations of Testing Revisited. ICSE 2011

Testing Artifacts – In Practice

Staats’ Framework

Theory in Practice

Complete Adequacy Criteria I.e.: is your testing regimen adequate, given the program structure, specification, oracle and test suite?

Complete Adequacy Criteria

Complete Adequacy Criteria Gay, Staats, Whalen, and Heimdahl, The Risks of Coverage-Directed Test Case Generation, FASE 2012, TSE 2015.

MC/DC Effectiveness DWM_2 Code structure has large effect! Choice of oracle has large effect! Vertmax_Batch DWM_3

Goals for “Good” Test Metric: Inozemtseva and Holmes, Coverage Is Not Strongly Correlated with Test Suite Effectiveness, ICSE 14 Effective at finding faults; • Better than random testing for suites of the same size • Better than other metrics • This often requires accounting for oracle Robust to changes in program structure Reasonable in terms of the number of required tests and cost of coverage analysis Zhang and Mesbah: Assertions Are Strongly Correlated with Test Suite Effectiveness, FSE 15

Another way to look at MC/DC Masking MC/DC can be expressed: Describes whether a condition is observable in a decision (i.e., not masked) Problem 1:any masking after the decision is not accounted for. Problem 2:we can rewrite programs to make decisions large or small (and MC/DC easy or hard to satisfy!) Where means, For program P, the computed value for the nth instance of expression e is replaced by value v

Reachability and Observability

Examining Observability With Model Counting and Symbolic Evaluation [ true ] test (X,Y) int test(int x, int y) { int z; if (y == x*10) S0; else S1; if (x > 3 && y > 10) S2; else S3; return z; } [ Y=X*10 ] S0 [ Y!=X*10 ] S1 [ X>3 & 10<Y=X*10] S2 [ X>3 & 10<Y!=X*10] S2 [ Y=X*10 & !(X>3 & Y>10) ] S3 [ Y!=X*10 & !(X>3 & Y>10) ] S3 Test(1,10) reaches S0,S3 Test(0,1) reaches S1,S3 Test(4,11) reaches S1,S2 Work by: Willem Visser,Matt Dwyer, Jaco Geldenhuys, Corina Pasareanu, Antonio Filieri, Tevfik Bultan ISSTA ‘12, ICSE ‘13, PLDI’14, SPIN ‘15, CAV ‘15

Probabilistic Symbolic Execution 104 [ true ] y=10x The statement z = 10 gets visited in 99.9% of tests int test(int x, int y: 0..99) { int z; if (y == x*10) S0; else z = 10; if (x > 1 && y > 10) z = 8; else S3; return z; } [ Y=X*10 ] [ Y!=X*10 ] 9990 10 x>3 & y>10 x>3 & y>10 But it only affects the outcome in 14% of tests 8538 1452 6 4 [ X>3 & 10<Y=X*10] [ Y=X*10 & !(X>3 & Y>10) ] [ X>3 & 10<Y!=X*10] [ Y!=X*10 & !(X>3 & Y>10) ]

Probabilistic SE 104 [ true ] y=10x Hard to reach Easy to reach [ Y=X*10 ] [ Y!=X*10 ] 9990 10 x>3 & y>10 x>3 & y>10 Easy to observe (Somewhat) Hard to observe 8538 1452 6 4 [ X>3 & 10<Y=X*10] [ Y=X*10 & !(X>3 & Y>10) ] [ X>3 & 10<Y!=X*10] [ Y!=X*10 & !(X>3 & Y>10) ]

How hard is it to kill a mutant? Just, Jalali, Inozemtseva, Ernst, Holmes, and Fraser. Are mutants a valid substitute for real faults in software testing? FSE 2014 Yao, Harmon, Jia, A study of Equivalent and Stubborn Mutation Operators using Human Analysis of Equivalence. ICSE 2014 Location, Location, Location Spoiler Alert Not hard at all More important than chicken or bull W. Visser, What makes killing a mutant hard? ASE 2016.

In the initial results They saw something interesting

What did they find? public static int classify(int i, int j, int k) { if ((i <= 0) || (j <= 0) || (k <= 0)) return 4; int type = 0; if (i == j) type = type + 1; if (i == k) type = type + 2; if (j == k) type = type + 3; if (type == 0) { if ((i + j <= k) || (j + k <= i) || (i + k <= j)) type = 4; else type = 1; return type; } if (type > 3) type = 3; else if ((type == 1) && (i + j > k)) type = 2; else if ((type == 2) && (i + k > j)) type = 2; else if ((type == 3) && (j + k > i)) type = 2; else type = 4; return type; } Stubborn Barrier Almost all Mutationsare Stubborn (<1%)

Why? public static int classify(int i, int j, int k) { if ((i <= 0) || (j <= 0) || (k <= 0)) return 4; int type = 0; if (i == j) type = type + 1; if (i == k) type = type + 2; if (j == k) type = type + 3; if (type == 0) { if ((i + j <= k) || (j + k <= i) || (i + k <= j)) type = 4; else type = 1; return type; } if (type > 3) type = 3; else if ((type == 1) && (i + j > k)) type = 2; else if ((type == 2) && (i + k > j)) type = 2; else if ((type == 3) && (j + k > i)) type = 2; else type = 4; return type; } Only 3% of inputspass here Almost all Mutationsare Stubborn (<1%)

A (Very) Small Experiment on Operator Mutations Arithmetic Operators: Triangle calculator: “general purpose” Software TCAS“Embedded” software Relational Operators:

Why is observability an issue for embedded systems? Often long tests are required to expose faults from earlier computations • Rate Limiters • Hysteresis / de-bounce • Feedback bounds • System Modes Physical systems can impede observability • Cannot observe all outputs • Or cannot observe them accurately Fault tolerance logic can impede observability • Richer oracle data than system outputs required Structure of programs can impede observability • Graphical dataflow notations (Simulink / SCADE) put conditional blocks at the end of computation flows rather than at the beginning.

Observable MC/DC Explicitly account for oracle Strength should be unaffected by simple program transformations (e.g., inlining) Idea: lift observation from decisions to programs Whalen, Gay, You, Staats, and Heimdahl. Observable Modified Condition / Decision Coverage. ICSE 2013

DWM1

DWM2 Latctl Vertmax Microwave

Adoption in SCADE and Mathworks Tools SCADE: Generalization to all variables. Called Stream Coverage Also in discussions with The Mathworks on these ideas • Currently they support an inlining solution for MCDC

Testing Code for Complex Mathematics

Testing Complex Mathematics Metrics describing branching logic often miss errors in complex mathematics Errors often exist in parts of the “numerical space” rather than portions of the CFG - Overflow / underflow - Loss of precision - Divide by zero - Oscillation - Transients

Matinnejad, Nejati & Briand: Metrics for Complex Mathematics Use multi-objective search-based testing to try to maximize the diversity of outputs vectors in terms of distance andthe number of numerical features, and to maximize failure features in a test suite.

Output Diversity -- Vector-Based Matinnejad, Nejati, and Briand, Automated Test Suite Generation for Time-Continuous Simulink Models. ICSE 2016 Output Matinnejad, Nejati, Briand, Bruckmann, Poull, Search-based automated testing of continuous controllers: Framework, tool support, and case studies. I&ST 57 (2015) Matinnejad, Nejati, Briand, Bruckmann: Effective test suites for mixed discrete-continuous stateflow controllers. FSE 2015 Time Output Signal 1 Output Signal 2

Failure-based Test Generation Maximizing the likelihood of presence of specific failure patterns in output signals Instability Discontinuity Output

Search-Based Test Generation Procedure Initial Test Suite Slightly Modifying Each Test Input Output-based Heuristics

Output Diversity: Comparison with Random • Seeded faults into mathematical software with few branches • Measure deviation from expected values • Two metrics: • Of is failure diversity • Ov is variable diversity

Output Diversity: Comparison with SLDV CAVEATS • Not much branching logic in models (MCDC strength) • MCDC is not very good at catching relational or arithmetic faults • SLDV is not designed for non-linear arithmetic and continuous time • However, this demonstrates need for new kinds of metrics and generation tools

Example: Neural Nets • Build a network of “neurons” that map from inputs to outputs • Each node performs a summation and has a threshold to “fire” • Each connection has a weight, which can be positive or negative • As of 2017, neural networks typically have a few thousand to a few million units and millions of connections. • Neural nets are trained rather than programmed. By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461

Machine Learning Use cases: • (Self-) diagnosis • Predictive Maintenance • Condition Monitoring • Anomaly Detection / Event Detection • Image analysis in production • Pattern recognition Increasingly proposed for use in safety-critical applications: road following, adaptive control

Neural Net Code Structure (in MATLAB) • function[y1] = simulateStandaloneNet(x1) • % Input 1 • x1_step1_xoffset = 0; • x1_step1_gain = 0.200475452649894; • x1_step1_ymin = -1; • % Layer 1 • b1 = [6.0358701949520981;2.725693924978148;0.58426771719145909;-5.1615078566382975]; • IW1_1 = [-14.001919491063946;4.90641117353245;-15.228280764533135;-5.264207948688032]; • % Layer 2 • b2 = -0.75620725148640833; • LW2_1 = [0.5484626432316061 -0.43580234386123884 -0.085111261420612969 -1.1367922825337915]; • % Output 1 • y1_step1_ymin = -1; • y1_step1_gain = 0.2; • y1_step1_xoffset = 0; • % ===== SIMULATION ======== • % Dimensions • Q = size(x1,2); % samples • % Input 1 • xp1 = mapminmax_apply(x1,x1_step1_gain, x1_step1_xoffset,x1_step1_ymin); • % Layer 1 • a1 = tansig_apply(repmat(b1,1,Q) + IW1_1*xp1); • % Layer 2 • a2 = repmat(b2,1,Q) + LW2_1*a1; • % Output 1 • y1 = mapminmax_reverse(a2,y1_step1_gain, y1_step1_xoffset,y1_step1_ymin); • end

…continued • % ===== MODULE FUNCTIONS ======== • % Map Minimum and Maximum Input Processing Function • function y = mapminmax_apply(x, settings_gain, settings_xoffset, settings_ymin) • y = bsxfun(@minus,x,settings_xoffset); • y = bsxfun(@times,y,settings_gain); • y = bsxfun(@plus,y,settings_ymin); • End • % Sigmoid Symmetric Transfer Function • function a = tansig_apply(n) • a = 2 ./ (1 + exp(-2*n)) - 1; • End • % Map Minimum and Maximum Output Reverse-Processing Function • function x = mapminmax_reverse(y, settings_gain, settings_xoffset, settings_ymin) • x = bsxfun(@minus,y,settings_ymin); • x = bsxfun(@rdivide,x,settings_gain); • x = bsxfun(@plus,x,settings_xoffset); • end Code observations: • No branches! • No relational operators!

So, how do we test this? • Black-box reliability testing? • How do we determine the input distributions? • How do we gain sufficient confidence for safety-critical use? • Ricky W. Butler, George B. Finelli: The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software • Mutation testing? • What do we mutate? • What is our expectation as to the output effect? • A specialized testing regime?

Testing Challenges for Next-Generation CPS Software