330 likes | 501 Views
Handling Faults in Sensor Networks. Problem Description. Faults are Common in WSS deployments! Faults are anything decreasing the quality (i.e. usability) or quantity of data. This means we have to address both Environmental sensor and “Network” sensor faults e.g. Faulty sensors
E N D
Problem Description Faults are Common in WSS deployments! Faults are anything decreasing the quality (i.e. usability) or quantity of data. This means we have to address both Environmental sensor and “Network” sensor faults e.g. • Faulty sensors • Faulty radios • Questionable environmental context Fault tolerance is a common technique to address faults in distributed and Internet Systems. • Replication & Redundancy – Yeah right • Diversity i.e. Multi-modal sensing => Requires detailed understanding of the environment. Often the very reason we are deploying the system • Furthermore, especially for rapid deployments, we want all of the data Fault Detection & Remediation
Hard Problem for Sensor Networks First, the scale is much larger in sensor networks Second, the phenomenon being observed in many applications of sensor networks are far more complex and unknown than the manufacturing and fabrication plants studied in classical process control. Third, while in process control the inputs provided to the plant are controlled and at least measured, such is not the case with many phenomena that sensor networks observe (environmental phenomena; inhabited buildings or other structures). Lastly, these systems are deployed in harsh, outdoor environments, users have limited visibility into each node, and unreliable and low- bandwidth channels mean that debugging information must be limited and is not transmitted reliably. Bullets largely taken from Integrity Proposal
Confidence This can be viewed as a classification problem. Confidence 1) classifies faulty (network and environmental) sensors, and 2) “labels” sensors with actions a user can take to remediate these faults Actions: Anything increasing the quality or quantity of data; e.g. • Replace or recalibrate sensor • Replace node • Extract physical samples
System Constraints • Operate in the absence of ground truth • Continually learn and adapt to its environment. No distinct training and operational phase • Have a transparent decision making process • Operate on-line
System Principles • Minimize role of static thresholds Knowledge-based systems, static thresholds, and decision trees are relatively simple, transparent, and scale to large amounts of data. However, assigning an expected range is not a complete solution. • Faults can occur in range • Rigid thresholds are rare: Threshold of 6 means 5.9999 is faulty while 6.00001 is good. Not likely. Even if we are able to set the threshold accurately, threshold-based systems are not designed to adapt to dynamic environments or to incorporate user interaction and knowledge gained during the course of the deployment. • Enabling scalability (not complete autonomy) Most faults impacting sensor networks require some kind of human intervention, be it calibrating a sensor, or moving a tree that has fallen on a node. Completely removing the human from the loop is neither practical nor possible. However, solely relying on the human to manually monitor and administer a large number of nodes and sensors is not feasible as well.
System Constraints • Operate in the absence of ground truth • Continually learn and adapt to its environment. No distinct training and operational phase • Have a transparent decision making process • Operate on-line • Supervised Learning • Adapts to dynamic environment • Requires access to ground truth • Unsupervised Learning • Clustering is most straightforward and transparent • K-means Clustering is simplest to implement and most closely meets our constraints • Modified K-Means • We modify K-means to create an on-line clustering algorithm
System Model Clustering Features Evaluation
System Model • Sensor is an entity that periodically returns data to the sink • Network sensors return system metrics • Environmental sensors return sensor data Sink calculates features Standard Deviation Gradient t2 t1 Network Cloud Sink F = <F1, F2, F3, F4> Feature Vector F describes state of sensor Feature vector used to cluster sensor data
System Intuition We reduce the problem of identifying and diagnosing system faults to identifying the correct feature space in which faults appear anomalous. Unpopular Cluster => Anomalous => Faulty Actions that fix one point in a cluster are assumed to fix all the points in that cluster
System Model Clustering Features Evaluation
Clustering Data using K-Means Closest to meeting system constraints 1st Iteration 2nd Iteration Image from: http://people.revoledu.com/kardi/tutorial/kMean/NumericalExample.htm
Making K-means On-line • Only cluster center closest to point is updated • Clusters and Points represented by feature vector in cluster space • Closest cluster => Minimum Euclidean distance from the point in the feature space • Points will not necessarily be grouped with their optimal clusters. => Not necessary, because Confidence’s priority is to accurately classify the most recent points. Example: 2-D feature space, represented by Features F1, and F2 EWMA used to update cluster center F2 <F1,F2> F1 Nice side benefit – automatic storage of anomalous points in cluster space
System State: Updated upon Point Arrival Cluster: Fc, Pc, Jc, Dc,Jc’ Fc = <F1, F2..FN>: Vector of N features representing cluster center Pc: Array of most recent p points Point: Fp Fp = <F1, F2..FN>: Vector of N features representing the point Cluster Space: m, s
Faulty Clusters are “Unpopular” • Join Rate JC: Number of points that join cluster C in a fixed window W • JC is calculated for each cluster every time a new point joins • uC and sC are calculated over all JC • Unpopular Clusters are those with low JC • Faulty: JC < uC - sC • Not sufficient in the case when faults are the common case • Penalize clusters that are farther from the origin by normalizing by their distance (DC ) • Features are chosen such that 0 => Good, and 10 => Bad • Clusters are initially evenly distributed in this N-dimensional space, where each feature varies from 0-10 • Unpopular Clusters are those with low J’C • Faulty: JC /DC < uC - sC
System State: Updated upon Point Arrival System State: Updated upon Point Arrival Cluster: Fc, Pc, Jc, Dc,Jc’ Fc = <F1, F2..FN>: Vector of N features representing cluster center Pc: Array of most recent p points Jc: Join rate for a cluster: number of points in Pc from unique sensors that have arrived in the last W seconds Dc: Distance of cluster from origin Jc’ = Jc / Dc: Join rate adjusted to penalize clusters farther from origin Point: Fp Fp = <F1, F2..FN>: Vector of N features representing the point Cluster Space: m, s m: Mean Jc over all clusters s: Standard deviation of Jc over all clusters
Labeling Clusters with Actions Actions that fix one point in a cluster are assumed to fix all the points in that cluster Point joins faulty cluster => Confidence notifies user of actions associated with cluster User notifies Confidence when they take an action If next point from sensor joins a good cluster… action is associated with cluster and positively validated
Bootstrapping a Cluster with Actions If no actions are associated with a cluster, Confidence calls on default actions Each feature is associated with a default action F = <F1,F2> If F1 > F2, then assign default action A1, else A2
System State: Updated upon Point Arrival System State: Updated upon Point Arrival Cluster: Fc, Pc, Jc, Dc,Jc’, Ac Fc = <F1, F2..FN>: Vector of N features representing cluster center Pc: Array of most recent p points Jc: Join rate for a cluster: number of points in Pc from unique sensors that have arrived in the last W seconds Dc: Distance of cluster from origin Jc’ = Jc / Dc: Join rate adjusted to penalize clusters farther from origin Ac = Vector of actions associated with cluster Point: Fp Fp = <F1, F2..FN>: Vector of N features representing the point Cluster Space: m, s m: Mean Jc over all clusters s: Standard deviation of Jc over all clusters
System Model Clustering Features Evaluation
Feature Selection • Features should be chosen such as the feature value increases, the sensor quality should generally decrease. • Direct mapping not necessary • Using this simple constraint, clusters that are farther from the origin are more likely to be faulty. • We have chosen features, based on deployment experience and sensor domain knowledge, that are useful in describing the general quality of the sensor. • For example, one feature for environmental sensors is the standard deviation of data within a short window of samples. While we do not necessarily have a rigid cut off for what are good and bad values for this feature, we know that as the standard deviation increases, the data is more likely to be faulty and the hardware more likely to require action.
Normalizing Features • Based on domain knowledge, features are approximately normalized to 0-10. • 0 => generally good, and 10 => generally bad • Without scaling, Euclidean distance will be unfairly biased by features with naturally larger numbers • Clustering framework is robust to imprecise normalization (shown in evaluation)
Non-Linearity in Features • As feature value increases, significance decreases • E.g. It is much more significant to not hear from a node between 30 and 60 seconds, then it is to not hear from the node between 1030 and 1060 seconds. • So, the distance between a cluster at 30 seconds and another at 60 seconds should not be equivalent to the distance between a cluster at 1030 seconds and 1060 seconds. • Take log2 of most features when mapping to cluster space • log2 30 = 5 • log260 = 6 • log2 1030 and log2 1060 = 10.
System Model Clustering Features Evaluation
Goal of Evaluation Demonstrate that Confidence’s approach of suggesting broad actions using clustering can help users detect and remediate faults more quickly and accurately than previous approaches. • Evaluate the detection latency and number of false positives in multiple different scenarios: 1) Single fault injection; and 2) Multiple fault injection. • Evaluate impact our choice of parameters has on system performance for the EWMA, number of clusters, feature scaling. • Discuss our experiences in deploying Confidence in several test and real-world deployments
Contributions • We propose Confidence, the first unified system comprising automated techniques geared towards increasing the quantity of usable/high quality data for a large number of sensors. Confidence enables users to effectively administer large numbers of sensors and nodes by automating key tasks and intelligently guiding a user to take actions that demonstrably improve the data and network quality. Confidence has been implemented and deployed in multiple deployments. • Improving Latency and Accuracy Through Less Precise Diagnoses Automatic clustering with Confidence metrics can detect real and simulated faults quickly and with low false positives. • We apply this approach to improving both network and data integrity in a sensor network, demonstrating its generalizability. We also show this approach to be simple, and require minimal configuration. Making it robust to human error, and adaptible to changing environments – key elements in designing systems for WSNs.