470 likes | 624 Views
Distributed Indexing and Querying in Sensor Networks using Statistical Models. Arnab Bhattacharya arnabb@iitk.ac.in Indian Institute of Technology (IIT), Kanpur. Wireless sensor networks.
E N D
Distributed Indexing and Querying in Sensor Networks using Statistical Models Arnab Bhattacharya arnabb@iitk.ac.in Indian Institute of Technology (IIT), Kanpur
Wireless sensor networks • “Sensor” is a tiny, cheap communicating device with limited memory, communication bandwidth and battery life • Communication is precious • Provides monitoring of physical phenomena • Wireless sensor network (WSN): a collection of such sensors • Enables spatio-temporal monitoring of events • Inter-communication among neighboring sensors • Base station as a centralized point of entry CS, ULB
Semantic modeling • Uses of WSNs • How many rooms are occupied? • Is there a fire in any room? • What is the pattern of birds’ movements? • Low-level individual sensor readings do not provide semantics • Content summarization by modeling • Which models to use? • Where and when to model? CS, ULB
Outline • Semantic modeling • Which models to use? • Where and when to build the models? • MIST: An index structure • Query algorithms • Experiments • Conclusions CS, ULB
How to model? • Zebranet • Track movement of zebras by velocity sensors • Three discrete states: • Grazing (G) • Walking (W) • Fast-moving (F) • Zebras’ behavior by state sequence • G W W W W F F G G, G G F F F W W W CS, ULB
Zebra Mobility: HMM Statistical models • Markov Chain (MC) • Provides inference about behavior in general • τ: transition probabilities • π: start state probabilities • Hidden Markov Model (HMM) • Try to infer the causes of such behavior • ξ: emission probabilities • Use of either model depends on the context Zebra Mobility: MC CS, ULB
When and where: Queries • Identify interesting behaviors in the network • Example: Identify all zebras (sensors) that observed the behavior pattern FFFF with likelihood > 0.8 • May denote possible predator attack • Sequence queries • Range query: Return sensors that observed a particular behavior with likelihood > threshold • Top-1 query: Which sensor is most likely to observe a given behavior? • Model queries • 1-NN query: Which sensor is most similar to a given pattern (model)? CS, ULB
Centralized solution • Each sensor • Builds a model • Transmits the model to the base station (BS) • Queries come to BS • BS answers them • No query communication • Each update in a sensor is transmitted • Huge update costs CS, ULB
Slack-based centralized solution • To save update costs • Introduce slack locally at each sensor • No update if new parameter is within slack of old parameter • Update costs reduced • BS knows slack • Finds range for likelihood from each sensor • If cannot be answered by cached models, then query transmitted to the sensor • Query communication costs are introduced CS, ULB
Outline • Semantic modeling • MIST: An index structure • Correlation among models • Composition of models • Hierarchical aggregation of index • Dynamic maintenance • Query algorithms • Experiments • Conclusions CS, ULB
MIST (Model-based Index Structure) • Overlay a tree on the network • Each sensor trains a model (MC/HMM) using observed sequences • Aggregation of child models into parent using correlation among models • Two types of composite models • Bottom-up aggregation of index models • Update in models handled by slack CS, ULB
Correlation among models • Models λ1,..., λm are (1-ε)-correlated if for all corresponding parameters σ1,...,σm: • ε →0: High correlation • Models are similar CS, ULB
Outline • Semantic modeling • MIST: An index structure • Correlation among models • Composition of models • Hierarchical aggregation of index • Dynamic maintenance • Query algorithms • Experiments • Conclusions CS, ULB
Average index model • λavg maintains • Average of all corresponding parameters: • ε’: Correlation parameter between λavg and any λi • βmax, βmin: maximum and minimum of all parameters from constituent models CS, ULB
Min-max index models • λmin and λmax maintains • Minimum and maximum of all corresponding parameters: • No extra parameter CS, ULB
Comparison • Statistical properties • Average: Valid statistical models • Transition and start state probabilities add up to 1 • Min-max: Pseudo-models • Probabilities, in general, do not add up to 1 • Parameters • Average: 3 extra parameters • Total n+3 parameters • Min-max: no extra parameter • Total 2n parameters CS, ULB
Outline • Semantic modeling • MIST: An index structure • Correlation among models • Composition of models • Hierarchical aggregation of index • Dynamic maintenance • Query algorithms • Experiments • Conclusions CS, ULB
Hierarchical index • Average model • Correlation parameter ε’ • Correlation gets reduced • βmax (βmin) • Maximum (minimum) of βmax(βmin)’s of children • Bounds become larger • Min- (max-) model • Aggregation of min- (max-) model parameters • Min (max) becomes smaller (larger) CS, ULB
Dynamic maintenance • Observations and therefore models change • Slack parameter δ • Models re-built with period d • Last model update time u • No update if λ(t+d)is within (1- δ) correlation with λ(u) • Correlation parameter εslack maintained in the parent as • Hierarchical index construction assumes εslack CS, ULB
Outline • Semantic modeling • MIST: An index structure • Query algorithms • Sequence queries • Model queries • Experiments • Conclusions CS, ULB
Queries • Sequence queries • Query sequence of symbols: q = q1q2...qk • Range query: Return sensors that have observed q with a probability > χ • Top-1 query: Given q, return the sensor that has the highest probability of observing q • Model queries • Query model: Q = {π,τ} • 1-NN query: Return the sensor model that is most similar to Q CS, ULB
Range query • Probability of observing q from λ is • q is of length k • σi is the ith parameter in P(q| λ) • For MC λ = {π,τ}, • For HMM, P(q| λ) is calculated as a sum along all possible state paths, each having 2k terms • Idea is to bound every parameter σi separately CS, ULB
Bounds • Average model • Use of δ and εslack to correct for changes after the last update • Therefore, bounds for P(q| λ) are • Min-max model CS, ULB
Top-1 query • For any internal node • Each subtree has a lower bound and an upper bound of observing q • Prune a subtree if its lower bound is higher than upper bound of some other subtree • Guarantees that best answer is not in this subtree • Requires comparison of bounds across subtrees • Pruning depends on dissimilarity of subtree models CS, ULB
Model (1-NN) query • Requires notion of distance between models • Euclidean distance or L2 norm • Corresponding parameters are considered as dimensions • Straightforward for MCs • For HMMs, state correspondence needs to be established • Domain knowledge • Matching CS, ULB
Average models • M-tree like mechanism • 1-nearest-neighbor (1-NN) query • “Model distance” space is a metric space • Topology is the overlaid communication tree • Average model maintains radius as largest possible distance to any model in the subtree • For each parameter CS, ULB
Min-max models • R-tree like mechanism • 1-nearest-neighbor (1-NN) query • “Model parameter” space is a vector space • Topology is the overlaid communication tree • For each parameter σi, there is a lower (σimin.(1-δ)) and an upper bound (σimax/(1-δ)) • The min-max models thus form a bounding rectangle • Similar to MBRs CS, ULB
“Curse of dimensionality” • Dimensionality = number of model parameters • No “curse” for sequence queries • Each index model computes two bounds of P(q|λ) • Pruning depends on whether χ (threshold) falls within these bounds • Bounds are real numbers between 0 and 1 • Single dimensional space – probability line • “Curse” exists for model queries • R-tree, M-tree like pruning on parameter space CS, ULB
Outline • Semantic modeling • MIST: An index structure • Query algorithms • Experiments • Experimental setup • Effects of different parameters • Fault-tolerance • Conclusions CS, ULB
Optimal slack • Large slack minimizes updates but querying cost goes up • Reverse for small slack • Optimal can be chosen by analyzing expected total costs • Non-linear optimization • Difficult for local nodes • Almost impossible over the entire network • Changes in the models require re-computation • Experimental method CS, ULB
Fault-tolerance • Periodic heartbeat messages from child to parent • Extra messages • When parent fails or child-parent link fails • Child finds another parent • Sends model parameters • Model, correlation, etc. is calculated afresh in parent • When node or link comes up • Child switches to original parent • Old parent notified • Parents update their models, correlation, etc. CS, ULB
Outline • Semantic modeling • MIST: An index structure • Query algorithms • Experiments • Experimental setup • Effects of different parameters • Fault-tolerance • Conclusions CS, ULB
Experimental setup • Two datasets • Real dataset • Laboratory sensors • Temperature readings • Readings for every 30s for 10 days • 4 rooms, each having 4 sensors • States: C (cold, <25°C), P (pleasant), H (hot, >27°C) • Synthetic dataset • Network size varied from 16 to 512 • State size varied from 3 to 11 • Correlation parameter ε varied from 0.001 to 0.5 • Both MCs and HMMs • Metric to measure • Communication cost in bytes CS, ULB
Compared techniques • Centralized with no slack • Node transmits all updates to BS • Zero querying cost • Centralized with slack • Node maintains slack • Query sent to sensor nodes if cached models at BS cannot answer • MIST schemes • Average/min-max models • With/without slack CS, ULB
Effect of query rate • Slack-based schemes win at small query rates • Centralized scheme with no slack is the best at very high query rates CS, ULB
Update costs • No-slack schemes have almost double costs • MIST’s slack schemes are better since updates are pruned at every level in the hierarchy CS, ULB
Query costs • Costs increase with decreasing correlation (1-ε) • At high correlation (low ε), no-slack schemes (including centralized) perform better CS, ULB
Optimal slack • Minimum exists for MIST’s schemes • Centralized: Due to low query rate, update costs dominated over querying costs CS, ULB
Network size • No-slack schemes are better • Querying cost increases due to higher bounds and longer path lengths to leaf nodes CS, ULB
Number of states: update costs • Update costs increase with number of states • MIST schemes are scalable due to hierarchical pruning CS, ULB
Number of states: query costs • Querying cost decreases • Each model parameter σ decreases • Probability of observing q, i.e., P(q|λ) decreases • Therefore, bounds decrease CS, ULB
Number of states: total costs • For sequence queries, no “curse of dimensionality” CS, ULB
Number of states: model query • For model queries, “curse of dimensionality” sets in • Scalable up to reasonable state sizes CS, ULB
Fault-tolerance experiments • Costs increase moderately due to parent switching • Scalable with probability of failure CS, ULB
Outline • Semantic modeling • MIST: An index structure • Query algorithms • Experiments • Conclusions • Future work CS, ULB
Conclusions • A hierarchical in-network index structure for sensor networks using statistical models • Hierarchical model aggregation schemes • Average model • Min-max models • Queries • Sequence queries • Model query • Experiments • Better than centralized schemes in terms of update, querying and total communication costs • Scales well with network size and number of states CS, ULB
Future work • How to overlay the tree? • Similar models should be in the same subtree • “Quality” of tree • Distributed solutions • What happens when models are updated? • Fault-tolerance • How to find the best parent during faults? • Whether to switch back or stay after recovery • How to replicate information in siblings? • Deployment CS, ULB