Being SMART About Failures: Assessing Repairs in Smart Homes

2013.02.25 Study Group Junction Being SMART About Failures: Assessing Repairs in Smart Homes KrasimiraKapitanova, EnamulHoque, John A. Stankovic, Kamin Whitehouse, Sang H. Son University of Virginia, DGIST, Deptof Information and Communication Engineering

Outline • Introduction • Proposed Solution • SMART Approach Detail • Experimental Setup • Results • Discussions • Conclusions

Introduction • Smart home applications: • Home automation, energy efficiency, home security • Commercial Sensors: • Inexpensive, wireless, battery-powered • Reducing hardware and installation costs

Problems • Do-it-yourself, low-cost sensors • Homes with hundreds of sensors had one sensor failure per day on average • Suffer from many type of faults: • Break down, lose power • Non-fail-stop failure • Sensor does not completely fail. It continues to report values, but the meaning of the value changes or becomes invalid. • e.g. be dislodged, fall off and be re-mounted, covered by objects or blocked by an open door or re-arranged furniture. The maintenance cost of fixing all such failures is prohibitive, and may negate any cost advantage of inexpensive hardware and installation

Existing solution • Detect and report fail-stop hardware failures • Repeatedly querying the nodes or checking for lost data • Not non-fail-stop (still response) • Detect non-fail-stop failures • Exploiting correlations between neighboring sensors • bottom up approach • Use patterns in the raw sensor data • (O) Homogeneous, periodic and continuous-valued sensors • (X) Heterogeneous, binary, and event-triggered sensors

Proposed solution • Simultaneous Multi-classifier Activity Recognition Technique (SMART) • Use top down application-level semantics to detect, assess and adapt to sensor failures • Detect non-fail-stop node failures: • Getting stuck at a value, node displacement, or node relocation • Runtime failure detectionusing multiple classifier instances • That are trained to recognize the same set of activities based on different subsets of sensors • One fails -> affect a subset of the classifiers -> change the ratio of activity detection among the classifier

Proposed solution • Once failure is detected • Adapts to the failure • excluding the failed node • creating a new classifier ensemble based on the remaining subset of nodes • Use data replay analysis to assess • whether the failure would have affected activity recognition in the past had the new classifier ensemble been used. • Y: dispatches a maintenance person to repair the failure • N: no maintenance is necessary

Comparison with bottom up method • How accurately correlation-based techniques can detect non-fail-stop failures in event-driven application? • Activity recognition • 43-day-long dataset from a two-resident home • Failure -> movement failure, where one of the kitchen motion sensor is accidentally moved to point in a different direction

Correlation-based results • Cannot achieve failure detection accuracy higher than 80% • The accuracy decreases as the number of consecutive testing days increases. • Reason: • Temporal correlation between nodes => when two nodes fire together • But not looks at which activity is being performed • Application-level feature • Top-down approach !

SMART Approach Train classifier instances for all possible combinations of node failures by holding those nodes out of the training set. Analyze the effect of sensor failures on the classifiers’ performance The training and use of the classifier ensemble 1 How SMART maintains high detection accuracy under failures 4 Update the classifier ensemble to contain classifiers that are trained for the failure. 2 The detection of non-fail-stop failures The node failure severity analysis, which allows to decrease the number of maintenance dispatches. 3 If the new classifier ensemble can maintain the detection accuracy of the application above the specified severity threshold THS.

Using multiple simultaneous classifiers • Single-node failures • |S|+1 classifier is used • Detect the occurrence of failure • Identify which nodes have failed by monitoring the relative behavior of the original classifier instance and that of the other |S| instances • Maintain high detection accuracy in the presence of failures

Failure detection • By analyzing the relative performance of the classifiers that had that node in its training set versus those classifiers that did not. • e.g. sensor s fails, a change in the relative behavior of • C ( S , S-s ) and C ( S-s , S-s ) • Calculate F-score of each of these classifiers with respect to the original classifier C0 to measure the similarity between their outputs. • e.g. the F-score for a classifier Ci( S-si, S )

Definition Precision = Recall = F-score =

Failure detection • Each of the |S| classifiers has an F-score associated with it, forming F-score vector: • Characterize the behavior of the system when there are no failures • If a failure occurs, based on the severity of the failure, it might affect some or even all of the values in the F-score vector.

Failure detection • Since CA, CB, and C0 have all been trained with node C, their relative ratios will remain similar • They are affected in a similar way. • CC was trained by holding node C out and will therefore change in behavior relative to classifier C0. • SMART can thus infer a failure has occurred and identify the cause of that failure.

Failure Recognition • Step 1: Is there a failure or not? • FDC (failure detection classifier) • Trained to distinguish between non-failure F-score vectors and failure F-score vectors. • Method: use historical data to generate both failure and non-failure F-score and train the FDC. • Run-time: • The system calculates the relative F-scores for the |S| classifiers and builds a F-score vector. • FDC determine if the F-score vector represents a system that has a failure or not. => determine the failure detection accuracy Artificially introducing failures in the historical data through modifying the readings of the “failed” nodes.

Failure Recognition • Step 2: Which node has failed? • FIC (failure identification classifier) • Determines which of the nodes has failed. • Is trained to distinguish between different node failures. • By modifying the readings of the “failed” nodes in the historical data. • Evaluate the failure identification accuracy.

Reaction to small fluctuations • Small fluctuations between F-score vectors might occur even without failures. • If residents of a home alter their normal activity patterns • Non-sever failures might cause very small changes to the behavior of the classifiers. • FDC -> false positive and false negative • To improve the accuracy of the failure detection: • Increase the size of the historical data used for training • Increase the failure detection latency

Node failure severity analysis • node failures are going to impact the application differently based on the available level of redundancy in the system. • e.g. cooking activity • 2 sensors in the kitchen • the star one close to the sink • Different degree of impaction • More sensors?

Node failure severity analysis • Severity measurementwhen failure detected • Assume sensor s fails • To determine the effect of the failure on the appliction • If it decreases the detection accuracy below the severity threshold THS, and maintenance is dispatched every time a severe failure is detected. • THS, can be specified by per-activity severity

Maintaining detection accuracy under failure • SMART uses a variety of classifier types • Naïve Bayesian • Hidden Markov Model • Switch to the classifier instance among all types and training sets that performed best on the training data

Experimental Setup • Evaluate SMART on 3 houses in 2 publicly available activity recognition datasets. • 1) CASAS datasets: • 40 days of labeled data from over 60 sensors, 2-resident home • 2) two single-resident homes: House A and House B • House A: 25 days from 14 sensors • House B: 13 days from 27 sensors

Experimental Setup • Complex activities detection (more than 1 sensor) • Severity threshold THS = 0.3 • The datasets do not contain failure information • All node failures in the experiments were simulated by modifying the values reported by the “failed” node. • “stuck at” failure: set the value of the failed node to 1 • “misplacement” failure: replaced the data of the failed sensor s with data from a sensor located at s’ new position

Results • Detecting sensor node failures • Kitchen -> living room

Results • Node failure severity assessment • WSU: 1/8, HA: 3/8, HB: 3/7 significant sensors

Results • Evaluate SMART’s impact on the MTTF of the application • MTTF: number of time units after which the detection accuracy falls below THS • Unlike the baseline, our approach determines that the application has filed not when the first node fails, but when the first severe node failure occurs.

Results • Detailed view of the MTTF for prepare breakfast from the WSU house.

Results • Maintaining high activity recognition accuracy under failures • Compare to a classifier trained on all nodes in the system, our approach achieves higher activity recognition accuracy in the presence of node failures

Results

Discussion • Effect of node use on the failure detection accuracy • The failure detection accuracy of a node v.s. • How frequently this node is used for a particular activity • node usage ratio per activity • The percentage of instances of that activity where node n is was used A positive correlation between the importance of that node for the activity and how accurately we can detect the node’s failure

Discussion • Effect of node use on the failure detection accuracy • The accuracy of detecting a “stuck at” failure for the important nodes for the kitchen activities in House A. • When only the important nodes are considered, the failure detection accuracy increases dramatically

Discussion • Limitations and future work • SMART cannot accurately detect the failures of sensors that are not frequently used in any of the activities. • SMART can be combined with state of the art health-monitoring systems, which can accurately detect fail-stop failures experienced by the rarely used nodes. • Assumption made: single-node failures not multiple failures • Plan to analyze how SMART’s failure detection accuracy is affected by multiple-node failures.

Conclusions • SMART: a general failure detection, assessment, and adaptation approach for smart home applications • Decreases the number of maintenance dispatches by 55% • Triples the MTTF of the application on average • Maintains sufficient activity recognition accuracy in the presence of failures by dynamically updating the classifiers at runtime with over 85% accuracy • Improves the activity recognition accuracy under node failures by more than 15% on average.

Being SMART About Failures: Assessing Repairs in Smart Homes