410 likes | 637 Views
Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked-Embedded Systems. Farinaz Koushanfar, ECE Dept. Rice University Statistics Colloquium Oct 9, 2006. outline. Sensor Networks: Applications, Challenges
E N D
Coordinated Statistical Modeling and Optimization for Ensuring Data Integrity and Attack-Resiliency in Networked-Embedded Systems Farinaz Koushanfar, ECE Dept. Rice University Statistics Colloquium Oct 9, 2006
outline • Sensor Networks: Applications, Challenges • Coordinated Modeling-Optimization Framework • Inter-sensor models • Embedded sensing models • Optimization for data integrity • Attack Resilient Location Discovery • Problem formulation and attack models • Robust random sample consensus for attack detection Sensor Networks: Applications, Challenges
Air pollution Texas wine!! http://www.ucsusa.org/clean_energy/coalvswind/c02c.html Vibration in Abercrombie http://www.alamosawinecellars.com/vineyard2.htm Flood in Houston http://dacnet.rice.edu/maps/space/index.cfm?building=abc http://www.bluishorange.com/flood/photos/10bridgecars.jpg Sensor Networks • Comprehensive monitoring and analysis of complex physical environments • Imagine…
Sensor Networks, How? • Networks of embedded sensing (actuating) and computing devices Mica2Dot, CrossBow Tech. Courtesy of Prof. Estrin, CENS, ULCA
Challenges in Sensor Networks • System: sensors, actuators, hardware, software, communication network layers, • Limited: battery, bandwidth, cost • Unique to sensor networks: Sensing • Abstract the system state, complex properties, and model physical phenomena accurately, without biases • Parametric models: a priori assumptions • Often do not capture the complex relationships • Optimization based on such models have a limited effectiveness
Challenges in Sensing • Massive datasets • Structure response in USGS building: 72 channels of 24 bit data, 500 samples/sec. • Energy consumption of the wireless nodes • Motes take 36mW in active mode AA batteries + storage capacity of 1850mWh 50h active mode • Diversity in applications • Marine biology, seismic sensing, battlefield, contaminant transport, home sensors, laboratories, hospitals, etc. • Harsh environmental conditions • Battlefield, earthquakes, automatic detection, etc. • Wireless channel data loss • Sensor cost • Sensitivity of applications • Privacy and security
Inconsistencies in the Measured Sensor Data • Erroneous measurements • Noisy readings: inevitable due to power and cost constraints and environmental impact • Systematic errors: offset bias, calibration effect, etc • Partially corrupted, still useful • Faulty (corrupted) measurements • Remove faults to get a consistent picture • Can be accidental (e.g. bad link), or malicious • Missing data • May be accidental, intentional (sleeping, subsampling, compression, filtering), or malicious
Outline • Sensor Networks: Applications, Challenges • Coordinated Modeling-Optimization Framework • Inter-sensor models • Embedded sensing models • Optimization for data integrity • Attack Resilient Location Discovery • Problem formulation and attack models • Robust random sample consensus for attack detection Coordinated Modeling-Optimization Framework
Motivational Example • Deployments show a gap b/w models and the reality • Example: preliminary analysis of temperature sensor traces at UCLA BG • 23 sensor nodes, sampling each 5 mins • Question: does the locality assumption hold?
Motivational Example (Cont’d) • No consistent relation b/w sensing and distance • Discontinuities, exposure differences, global sources • Also, some highly correlated close-by sensors • Best previous effort: local basis functions • Need new models for simultaneous abstraction of sensing and distance • What about other properties?
Motivational Example (Cont’d) • Separation of concerns • Embedded sensing models: • Define multiple graphs G1, G2, …, GM, that share vertices • E.g., sensing, distance • dij: distance b/w si,sj • eij: sensing prediction error, for the model sj=fij(si) • The distance and sensing are not jammed into one model, but are being simultaneously considered
Motivational Example-2 • Cross-domain optimization: Sensor deployment • Objective: select up to S candidate points for adding an extra sensor • For each si, a TL sensor is Delaunay neighbor but cannot be predicted within th error bound • Denote the edges of TL sensors as candidates • Find intelligent ways to select the best set of candidate points
Motivational Example (Cont’d) • Coordinated modeling-optimization • Q1: How to do cross-domain optimization? • Q2: Can the models be of higher dimensions? • Q3: Can they help us to address data-integrity problem? • Q4: How effective are they?
Outline • Sensor Networks, Applications, Challenges • Coordinated Modeling-Optimization Framework • Inter-sensor models • Embedded sensing models • Optimization for data integrity • Attack Resilient Location Discovery • Problem formulation and attack models • Robust random sample consensus for attack detection Inter-sensor models
Inter-sensor Models • Intra-sensor models (autoregressive models) • Have shown the effectiveness of adding shape constraints to univariate models • Isotonicity • Unimodularity • Number of level sets • Convexity • Bijection • Transitivity • Combinatorial isotonic regression (CIR), finds the optimal nonparametric shape constrained univariate fit for an arbitrary error norm in average linear time • Models are precursor for subsequent optimization
Application of CIR on Temperature Sensors at Intel Berkeley* • Prediction error over all node pairs • Limiting the number of level sets * Koushanfar, Taft (Intel), Potkonjak Infocom’06
Multivariate CIR • Recent result*: • The first optimal, polynomial-time DP-based approach for multi-dimensional CIR: (1) Build the relative importance matrix R (2) Build the error matrix E (3) Build a cumulative error matrix C by using a nested DP (4) Starting at the minimum value in the last column of C, trace back a path to the first column that minimizes the cumulative error • Thanks to Prof. D. Brillinger (UCB), Prof. M. Potkonjak (UCLA) for the • useful discussions
Multivariate CIR - complexity • T sensor values drawn from a finite alphabet A • Complexity of univariate case is dominated by sorting (T log T) • Cm(M): complexity of multivariate with M explanatory variables • Cm(M)=AM+1Cm(M-1), pseudo-polynomial complexity
Open Questions • How to speed up the Multivariate CIR? • Pruning algorithms that exploit sparsity (?) • Is it possible to make CIR locally adaptive? • In principle, finding the min error is a global optimization that cannot be locally addressed • Can one guarantee convergence and correctness of CIR among sensors? • Is it possible to have continuous approximations to address the problem? • How can one build efficient models in presence of missing and/or faulty data?
Outline • Sensor Networks, Applications, Challenges • Coordinated Modeling-Optimization Framework • Inter-sensor models • Embedded sensing models • Optimization for data integrity • Attack Resilient Location Discovery • Problem formulation and attack models • Robust random sample consensus for attack detection • Evaluation and comparison to competing methods Embedded sensing models
State-of-the-Art Sensing Models • Parametric models • Gaussian random fields, graphical models (GM), message passing, iterative message passing, belief propagation (BP) • Nonparametric models • Marginalized kernels (GM), alternating projections, distributed EM, nonparametric BP • Common thread: capture dependence among sensor data, no edge means no dependence, • Need to capture the shape of field discontinuities and/or lack of correlations b/w adjacent nodes
Embedded Sensing Models • Principle of separation of concerns (SoC) • Example: Geometric graph (planar-2D) Delaunay edges (adjacency) Sensing graph: higher dimensional embedded graph 1 5 4 7 2 6 2 8 1 6 3 3 8 5 7 4 Idea: Map the sensing graph into lower dimensions. Exploit the discrepancy between the higher dimensional topology and the lower dimensional space to identify the obstacles
Open Questions • Efficient computation and handling of embedded sensing models in higher dimensions • Joint compression of multiple entities • How can we capture dynamic topologies, i.e. mobility, dynamic time series, sleeping • Efficient structures/data formats for representing the multi-dimensional topologies
Coordinated Modeling and Optimization • Paramount importance of interface in system and software development • Create statistical models suitable for optimization • Paradigms: continuous, smooth, consistent • Small number of level sets • Convexity • Bijection x'i= G(F(xi)) = xi, where yi=F(xi) and xi=G(yi) • Transitivity zi = F(xi) = G(yi) • Create optimization mechanisms resilient to statistical variability • Paradigms: randomization • Multiple validations • Constructive probabilistic • Reweighting of OF and constraints
Outline • Sensor Networks, Applications, Challenges • Coordinated Modeling-Optimization Framework • Inter-sensor models • Embedded sensing models • Optimization for data integrity • Attack Resilient Location Discovery • Problem formulation and attack models • Robust random sample consensus for attack detection • Evaluation and comparison to competing methods Optimization for data integrity
Data Integrity: Multiple Validations • The data-integrity problems are complex due to the complex environments and uncertainties • Proof of NP-completeness (PhD’05) • Data integrity (noise reduction, calibration, fault detection, data recovery) exploits system redundancies • Coordinated modeling-optimization • Multiple validations (MV)optimization algorithms • The solutions are validated using multiple input samples • Similar in spirit to cross-validation (CV) in statistics • MV is more comprehensive than CV, since it is a generic optimization paradigm based on resampling the input space and validating the output of a complex algorithm rather than a model
Example: Missing Data • Between 40%-50% missing data at Intel Berkeley testbed • Limited A2D: discrete level sets
Missing Data Recovery (MSD) 2 6 Problem formulation: • Given: • N sensors s1, …,sN, • Sensor’s data at time t: (d1(t), d2(t),…,dN(t)) • Some sensor data missing in an arbitrary way, i.e. there is i, such that di(t)=NA • Objective: recover the missing data in such a way that the consistency between the readings of different sensors is maximized (prediction error is minimized) 1 5 9 4 3 8 7
State-of-the-Art in MSD • MSD is a prevalent problem in many fields • Expectation maximization (EM) Dempster et al.’77 • Assuming multivariate density • Local optimization, likely to be trapped in the local max of the likelihood function • Multiple imputations (MI) Rubin 1987 • Missing data replaced by multiple simulated versions • May distort variable association dues to treating the completed dataset as the actual one • Both MI/EM can be computationally intensive • MV often combines lower dimensional models
MV for Missing Data Recovery • Iteratively select a sub-sample of available nodes (the present set) and optimize for it • Remaining nodes (holdout set) used for validating the solution, quantify its uncertainty • 1) Randomly assign :{1,…,|V|}{1,…,K}; • 2) for (=1 to =K) • a: calculate OF-(O); • b: compute MVC-k(O); • 3) MVC(O)=G(MVC-k(O)), =1, …, K; • 4) Obest= argminO MVC(O); • Advantage: not only a solution, but an uncertainty bound for the solutions
Open Questions • Theoretical proof of correctness of MV, which of the properties of CV holds for MV? • Which MV criteria (MVC) are robust to outliers: e.g., order statistics • Which objective function (OF) to use? • Ensemble-voting of weak classifiers by boosting (exponential loss function) • Real-time implementation on sensor networks testbeds • Scaling properties of the MV algorithm
Outline • Sensor Networks, Applications, Challenges • Coordinated Modeling-Optimization Framework • Inter-sensor models • Embedded sensing models • Optimization for data integrity • Attack Resilient Location Discovery • Problem formulation and attack models • Robust random sample consensus for attack detection • Evaluation and comparison to competing methods Attack Resilient Location Discovery
Location Discovery* • A number of nodes have location data (beacons) • Other nodes estimate their distance to beacons to find their locations • Many distance estimation methods (e.g., AoA, ToA) • If more than three beacons, node can estimate location • We focus on the atomic case (one unknown) s1 s2 s3 s5 s0 s4 s7 s6 s10 s8 s9 * Joint work with N. Kiyavash, UIUC
Robust Location Discovery: Problem Formulation • Instance: • A node s0 with unknown coordinates (x0,y0), • Set L of location tuples {(xn,yn,dn)} (n beacons), • Consistency metric (sn,s0), consistency threshold t • Problem: • Find an estimate for (x0,y0) s.t. it is at least(sn,s0)-consistent with t points in set L
Attack Model • The attackers can modify the distance measurement of any beacon without any limits • The network is cryptographically protected against protocol attackes, e.g., wormhole, sybil • The measurements from each beacon are only considered once • Both independent and coalition (colluding) attacks • In coalition attacks, the attacking beacons coordinate their efforts • There is a minimum number of correct beacons, otherwise colluding beacons will mislead the target
Robust Random Sample Consensus • Initialize i; • While (i<imax) • Randomly draw a subset Si of size 3 from L; • Use Si to estimate s^0; • Calculate K, the number of consistent points w.r.t s^0in L\Si; • If (K>t) • {form a new s^0 from the K points; Terminate;} • Increment i; • Terminate and output the largest consistent estimate;
Selecting the parameters: imax,t • q - prob. of correctness of a randomly drawn point • Expected number of trials, E[i]=1/q3 • - threshold for the prob of missing a good subset, (1-q3)imax= Or, imax= ln() / ln(1-q3) • I – set of inliers; - percentage of inliers • =1-Na/N • For large datasets q=3, E[i]=-9 • The number of iterations is
Comparison to Other Algorithms Colluding attackers Independent attackers
Summary • sensor networks: importance of sensing, data integrity – missing data, faults, noise, systematic errors • Coordinated modeling and optimization framework • Nonparametric models, shape constraints • Multivariate CIR, optimal algorithm, slow in multiple dimensions • Embedded sensing models, separation of concerns • Projection into lower dimensions • Optimization algorithm: multiple validations (MV) • Attack-resilient location discovery • 25% more effective in presence of coalition attackers, 35+% more effective on independent attackers