Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK 73019

Estimating Missing Data in Sensor Network Databases Using Data Mining to Support Space Data Analysis Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK 73019 ggruenwald@ou.edu This research is funded in part by NASA Grant No. NNG05GA30G

Estimating Missing Data in Sensor Network Databases Using Data Mining to Support Space Data Analysis • The Objective of this Research • Project Status • The Computing Environment • The Problem Definition • Possible Approaches • An Overview of Association Rules • The DSARM Framework • The CARMA Approach • Simulation Results • Conclusions and Future Work

The Objective of this Research To derive an algorithm to estimate a missing, corrupted or late reading from a sensor in a sensor network environment.

Project Status • Started October 1, 2004. • Developed an algorithm for estimating data in a centralized sensor network environment (without considering spatial and temporal dimensions). • Implemented the algorithm and conducted experiments using traffic data.

The Computing Environment (1) • Sensor Networks: Triggered by recent technology advances in Micro Electro Mechanical Systems (MEMS) technology, low-power analog and digital electronics, and low- power radio frequency (RF) design. • Purpose: To monitor, combine, analyze and respond to the data collected by hundreds (thousands) sensors distributed in the physical world in a timely manner. • Example: Space Science - sensors collecting MARS’ conditions. Transportation – sensors for traffic monitoring. Battlefield – sensors attached to soldiers, vehicles or scattered throughout important areas.

The Computing Environment (2) Data Streams SERVER SensorN Sensor2 Sensor1 Real World Queries Answers USER

The Computing Environment (3) • Data Streams - the most natural way to process data in the majority of sensor network applications; - an append-only collection of tuples that is ordered by some increasing key value (often time) [Zdonik,02] • Data Stream Example: sens_id, time(n-4), reading sens_id, time(n-3), reading sens_id, time(n-2), reading sens_id, time(n-1), reading sens_id, time(n), reading Sensor X

The Problem Definition (1) • What is the problem? - a tuple from a particular sensor may be late: due to unsynchronized sensors’ timers; due to networking issues; - a tuple may arrive on time, but may be corrupted: due to occurrences of local interferences; due to exhaust of power at the sensor; - a tuple may be lost: due to occurrences of local interferences; due to exhaust of power at the sensor; due to networking issues;

The Problem Definition (2) • Why is it a problem? The data gathered by the sensors is used by the queries running at the server. It can be expected that most of the queries are continuous, i.e. they are evaluated each time a new round of sensor readings arrives at the server.

The Problem Definition (3) • Why is it a problem? - In the case of a missing or corrupted tuple, some queries cannot be executed; - In the case of a late tuple, the result of some queries will be delayed.

Possible Approaches (1) Possible Approaches Ignore Ask sensor 2 or Estimate the again more sensors missing value Averaging Use existing techniques relations between sensors CARMA Approach

Possible Approaches (2) Combining Association Rule Mining with Average window size (CARMA) • Use association rule mining to generate a set of sensors that are related to the sensor with the missing value (MS). The values of the related sensors in the current round will contribute with different weights towards the estimated missing value. • If no related sensors can be found, use an averaging technique to estimate the missing value.

Overview of Association Rules (1) Apriori Algorithm [Agrawal 94] - steps required: • Find all frequent 1-itemsets: at least minSup% transactions in D contain the item (milk, bread, etc.) • Find all frequent 2-itemsets: at least minSup% transactions in D contain the 2 items (milk and bread, milk and jam, etc.) • Find all frequent 3,4…- itemsets, until no larger frequent itemset can be generated – the performance bottleneck • Having all frequent itemsets, generate all association rulesthat satisfy both minSup and minConf.

The DSARM Framework (1) • A standard association rule mining technique (e.g. Apriori) cannot be directly applied in a data stream environment: - The data to be mined is updated very frequently, all the steps are performed for every round with a missing value - poor time performance; - The data to be mined in a sensor network environment is of a different nature from basket data.

Differences between basket and sensor data Basket Data (Boolean) Sensor Traffic Data (Quantitative)

The DSARM Framework (2) • Consider itemsets and association rules always w.r.t. a particular sensor state (sensor value). • Store the sensor data in a form that will facilitate association rule mining. • Generate only the association rules that are useful in a particular situation. Formal Definition

Sensor1 Sensor1 Sensor2 Sensor2 MS MS Sensor3 Sensor3 SensorN SensorN The DSARM Framework (3) Apriori Framework DSARM Framework a a b c Find only the rules between pairs of sensors, in which the MS is a consequent, w.r.t. a particular state, satisfying both minSup and minConf Find all valid rules, satisfying both minSup and minConf

DSARM Framework (4) • Faster discovery of the association rules – only the 1- and 2- frequent itemsets are generated (avoiding the performance bottleneck for generating 3+ frequent itemsets)  faster estimation of the missing value; The effects of the proposed modifications: • Data structures containing the metadata (in form of counters) about the 1- and 2- frequent itemsets are memory space feasible  faster estimation of the missing value;

DSARM Framework (4) The effects of the proposed modifications: • Considering the association rules only w.r.t. a given state leads to a better accuracy of the estimation;

DSARM Framework (4) The effects of the proposed modifications: • The calculated actSup and actConf of the association rules cannot be used for determining the weight with which each related sensor will contribute towards the estimated missing value. (why?) • Instead, to determine the weight with which each related sensor Si will contribute towards the estimated value of MS, use the distance in the history of the {Si, MS} pair.

The CARMA Approach (1) The CARMA Approach is an implementation of the DSARM Framework + using an averaging technique to estimate the missing value in the cases association rule mining cannot produce an estimation. The CARMA Approach is a combination of: - the Buffer, the Cube, and the Counter – data structures to store information received by the sensors; - checkBuffer(), update(), and estimateValue() algorithms that use the data in the structures.

The CARMA Approach (2) • The Buffer Purpose: To store the readings from the current round. Implemented as an array of size equal to the number of sensors in the network. S0 S4 S2 S1 S3

The CARMA Approach (3) • The Cube Purpose: To keep track of all 1- and 2- itemsets observed in the last w rounds. Implemented as a 2D array of DLLs. The information from the newest round is stored at the front of the Cube and the information from the oldest round is stored at the back of the Cube. -1 3 -1 -1 S4 -1 oldest 2 -1 S3 -1 2 2 S2 2 2 -1 2 -1 Time S1 2 2 -1 2 -1 S0 -1 1 -1 -1 -1 S1 S0 S2 S3 S4 newest

The CARMA Approach (4) • The Counter Purpose: To speed up the estimation, counters for all possible 1- and 2-itemsets are maintained in the Counter. Implemented as a 3D array. The observed number of 1- and 2-itemsets for a particular state of sensor readings is stored in the corresponding (to the sensors and the state) node in the Counter. 4 3 2 1 S4 3 2 0 0 0 0 1 S3 1 2 2 S2 0 2 2 1 0 S1 4 2 1 2 2 S0 0 1 1 0 1 S1 S0 S2 S3 S4

The CARMA Approach (5) • The checkBuffer() Algorithm Purpose: To check for missing values in the current round stored in the Buffer and to initiate a proper action as a result of this check. 1.checkBuffer() { 2. while (true) { //repeat indefinitely 3. while (the time window for the current session with the sensors is still open) { 4. listen to sensors and record their readings; } 5. if (a missing data is detected in the Buffer) { 6. call estimateValue(); // to estimate and also to invoke update() } 7. else { 8. send OK signal to Application Queries to proceed; 9. call update(); } 10. clear the Buffer //set each cell to -1; } }

The CARMA Approach (5) • The update() Algorithm Purpose: To update both the Cube and the Counter with the data in the current round, which is stored in the Buffer. update() { for (each sensor Si, i = 0 .. numSens, reporting state s) { //i.e. form a 1-itemset w.r.t s add Cube[Si][Si].slice[0], data field = s; //add new node at the head of a DLL discard Cube[Si][Si].slice[last]; update Counter[Si][Si].state[all affected states]; for (each sensor Sj, j = i+1..numSens) { //add new node at the head of a DLL add Cube[Si][Sj].slice[0]; if (Sj reports the same state as Si) { //i.e. form a 2-itemset w.r.t. s the data field in the new node = s; } else { //Si and Sj do not report the same state the data field in the new node = -1; } discard Cube[Si][Sj].slice[last]; update Counter[Si][Sj].state[all affected states]; } } } Go to an example

The CARMA Approach (6) • The estimateValue() Algorithm Purpose: To estimate a missing sensor reading, using the data in the Counter, the Cube, and the Buffer. estimateValue(MS) { determine all eligible states “e” for the MS; //compare actSup with minSup for every possible state Go to an example for (every eligible state e) { create an empty StateSet_e; } distribute all sensors without missing values in the current round to a proper StateSet_e based on their states recorded in the Buffer; Go to the example for (every StateSet_e) { // determine the eligible sensors for (every sensor Si) { check if a rule Si MS | e is valid; //compare actSup with minSup and actConf with minConf if (not valid) { delete Si from the StateSet_e; } } } Go to the example

The CARMA Approach (7) • The estimateValue() Algorithm (cont.) …… // At this point we have generated all the eligible sensors, // grouped in the corresponding StateSet_e collections // Determine the weight with which each eligible sensor will contribute towards the // missing value being “e” and accumulate this weight in StateValue_e and totalWeight // variables for (every non-empty StateSet_e) { for (every sensor Si) { weightSi = number of occurrences in which Si and MS have reported the same state divided by the window size; // a distance in the history of the pair {Si, MS} StateValue_e = StateValue_e + weightSi; totalWeight = totalWeight + weightSi; } } Go to the example // calculate the estimated value missingValue = ( StateValue_e * e) / totalWeight, the sum is for every existing state set e } //end of estimateValue() Experimental ResultsGo to the example

1 3 1 2 2 The Buffer:S0 S1 S2 S3 S4 (the update() Algorithm for the very first round)Cube.slice[0]: Before Cube.slice[0]: After Counter.state[2]: Before Counter.state[2]: After Go Back

An example of the estimateValue() Algorithm Assume: - minimum support (minSup) = 25% - minimum confidence (minConf) = 25% - sliding window size = 5 The Buffer: S0 S1 S2 S3 S4 i.e. missing value for S1. 2 2 -1 1 1

An example of the estimateValue() AlgorithmStep 1. Determine all eligible states for MS Checking for state 1: actSup of S1 | 1 = 3 / 5 = 60% > minSup = 25% => State 1 is an eligible state for S1

An example of the estimateValue() AlgorithmStep 1. Determine all eligible states for MS Checking for state 2: actSup of S1 | 2 = 2 / 5 = 40% > minSup = 25% => State 2 is an eligible state for S1

An example of the estimateValue() AlgorithmStep 1. Determine all eligible states for MS Checking for state 3: actSup of S1 | 3 = 0 / 5 = 0% < minSup = 25% => State 3 is not an eligible state for S1

An example of the estimateValue() AlgorithmStep 1. Determine all eligible states for MS Checking for state 4: actSup of S1 | 4 = 0 / 5 = 0% < minSup = 25% => State 4 is not an eligible state for S1 Back to the algorithm

Created StateSets are: StateSet_1 and StateSet_2 The Buffer: S0 S1 S2 S3 S4 StateSet_1 = {S0, S2} StateSet_2 = {S3, S4} Back to the algorithm An example of the estimateValue() AlgorithmStep 2. Distribute the sensor in the proper StateSets 2 -1 2 1 1

An example of the estimateValue() AlgorithmStep 3. Determine all eligible sensors Checking StateSet_1: S0: actSup of S0  S1 | 1 = 2/5=40% > minSup=25% actConf of S0  S1 | 1 = 2/3=67% > minConf=25% =>S0 is an eligible sensor S2: actSup of S2  S1 | 1 = 3/5=60% > minSup=25% actConf of S2  S1 | 1 = 3/3=100% > minConf=25% => S2 is an eligible sensor

An example of the estimateValue() AlgorithmStep 3. Determine all eligible sensors Checking StateSet_2: S3: actSup of S3  S1 | 2 = 2/5=40% > minSup=25% actConf of S3  S1 | 2 = 2/4=50% > minConf=25% => S3 is an eligible sensor S4: actSup of S4  S1 | 2 = 1/5=20% < minSup=25% => S4 is not an eligible sensor, delete S4 from StateSet_2 Back to the algorithm

An example of the estimateValue() AlgorithmStep 4. Determine the weights for all eligible sensorsEligible sensors: S0 and S2 for state 1, S3 for state 2 StateValue_1= 2/5 (from {S0, S1}) + 5/5 (from {S2, S1}) =1.4 StateValue_2= 3/5 (from {S3, S1}) =0.6 TotalWeight = 1.4 + 0.6 = 2 Back to the algorithm oldest newest

An example of the estimateValue() AlgorithmStep 5. Calculate the missing value missingValue = ( StateValue_e * e) / totalWeight = = (1.4 * 1 + 0.6 * 2) / 2 = 2.6 / 2 = 1.3 The state of the S1 is thus estimated to be equal to 1. Back to the algorithm

A collection of 108 sensor nodes deployed on city streets. The sensors collect and report the number of the vehicles detected for a given time interval. sensor nodes server The data is obtained from [AFIDA 03]. Simulation Results (1)The Simulation Model B A

Simulation Results (2)The Simulation Model The Dynamic Parameters: - sliding window size (winSize): 6, 18, 30, 42 - minimum support, minimum confidence (MSMC): 0%, 1%, 2%, 4%, 7%, 10% - number of possible sensor reading states (numStates = number of subsets: 10, 20, 40, 80; - error rate of the wireless communication link (p): 0.1%, 1%, 10%

Simulation Results (3)Simulation Experiments • The achieved accuracy of the estimated value. • The total main memory access time to process a • round of sensors readings (TMMAT). • The required memory space. • The overall power consumption by the sensor • nodes (OPC). • The percentage of cases in which the • estimateValue() algorithm cannot produce an • estimation alone. Performance Measurements:

Simulation Results (4)Evaluation of the Accuracy for different approaches To evaluate the accuracy of the estimation we use the Normalized Root Mean Square Error (RMSE) where: - Xai and Xei are the actual value and the estimated value, respectively; - #estimations is the number of estimations performed in a simulation run; - #states is the number of subsets in which the actual sensor readings are distributed.

Simulation Results (5) Evaluation of the Accuracy for different approaches Compared are 4 different approaches for estimating a missing value: • The Previous Value (PV) approach • The Average Round (AR) approach • The Average Window Size (AWS) approach • The CARMA approach: a combination of the estimateValue() algorithm + the AWS approach for the cases in which association rule mining cannot generate an estimation of a missing value

Simulation Results (6) Evaluation of the Accuracy for different approaches

Simulation Results (7)Evaluation of CARMA TMMAT for different error rates p

Simulation Results (8)Evaluation of CARMA Memory Space Requirements

Simulation Results (9)Evaluation of the OPC for different error rates p Compared are the CARMA and MultiSend Approaches

Conclusions (1) This research has proposed an approach called CARMA(Combining Association Rule Mining with Average window size)for estimating missing values in related data streams. • The CARMA approach achieves the best accuracy • of the estimated missing values compared to • alternative approaches (PV, AR, AWS). • The response time for CARMA is more than the • response time for AWS, but will be acceptable for • a wide range of applications.

Conclusions (2) • The memory space required by CARMA is • feasible. • CARMA provides a power-conscious use of a sensor network.

Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK 73019

Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK 73019

Presentation Transcript

University of Bridgeport Computer Science

Columbus State University TSYS School of Computer Science

The University of Oklahoma

Xuguang Wang, Xu Lu, Yongzuo Li, Ting Lei University of Oklahoma, Norman, OK

Govindan Kutty M and Xuguang Wang University of Oklahoma, Norman, OK, USA

Xuguang Wang University of Oklahoma, Norman, OK xuguang.wang@ou Ting Lei, Govindan Kutty (OU)

Q2 Workshop, University Of Oklahoma, Norman, OK, June 28-30, 2005

Norman Sadeh, School of Computer Science, CMU Mark S.Fox

Norman, OK,1989

Matthew B. Johnson, University of Oklahoma Norman Campus, DMR 0520550

School of Computer Science Carnegie Mellon University

School of Computer Science Queen’s University Belfast

Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma

University of Oklahoma

Richard Frost School of Computer Science University of Windsor

Norman M. Sadeh ISR - School of Computer Science Carnegie Mellon University

University of Oklahoma

Qianyi Zhang School of Computer Science, University of Birmingham

Xuguang Wang University of Oklahoma, Norman, OK

School of Industrial Engineering University of Oklahoma

Ryerson University The School of Computer Science

Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma