Multidimensional Analysis of Atypical Events in Cyber-Physical Data

Multidimensional Analysis of Atypical Events in Cyber-Physical Data Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun

Outline • Introduction • Backgrounds • Model Construction • Query Processing • Experiments

Introduction • Cyber Physical System: Integrate physical devices (e.g., sensors, cameras) with cyber components to form a situation aware analytical system • Many promising applications • traffic observation • intruder/motion detection • battlefield surveillance • remote healthcare • Key task: Analyze the atypical data with multi-dimensionalinformation

Motivation Example I • Taffic Monitoring System: Typical CPS • Inductive loop sensors • Thousands, placed every few miles in highways • 24 hours * 7 days • monitoring traffic and report congestions

Motivation Example II • Questions from Transportation Officers • When do the congestion usually happen in downtown? • Where do the congestion happen in the weekday? • In the past three months, which road is the most seriously congested, how do those congestion start? • Traditional SQL query cannot satisfy them

Our Contribution • They demand the results that are summarized, self-organized and succinct, be delivered inshort time • Our goal • Construct a data model for atypical data in CPS • Support efficient query processing with such model

Challenges • Massive Data • Thousands of sensors generate giga-bytes, even tera-bytes of data • Complex Event • The atypical event is a dynamic process influencing multiple spatial regions • How to represent such an event? – new measure/model • Effectiveness & Efficiency • If the query range is large, many events are involved • Retrieve the significant ones in short time – new algorithm

Our Contribution • Introduce the techniques to discover atypical events and summarize them as atypical micro-clusters • Integrate the similar micro-clusters to macro-clusters to generate big figure • Construct the data model of atypical cluster forest • Using a guiding algorithm to retrieve the significant cluster efficiently

CPS Systems in Traffic Application • PeMS: collects data in California highway • CarWeb: collects real time GPS data from cars • Google Traffic: Toolkit on Google Map • CubeView by Shekhar et.al: Implement traditional OLAP on the traffic data • AITVS: based on CubeView, using two more distinct views to support investigation • Most focus on SQL based queries, lacking analysis power • Build on the whole dataset – huge I/O overhead, atypical data are dwarfed

Other Spatial OLAP Techniques • Spatial Cube by Stefanovic et. al: dimension members are spatially referenced and can be represented on a Map • Trajectory Cube by Giannotti et. al: include temporal, spatial, demo-graphic and techno-graphic dimensions, two kinds of measures: spatial measure and numerical measure • Flow Cube by Gonzalez et. al: analyzing item flows in RFID applications • Different object – cannot use them directly in this problem

Preliminaries • Atypical record: (s, t, f(s,t)) • s: sensor • t: reported time • f(s,t): severity measure • Analytical query Q(W, T, etc) • W: spatial region • T: time period • There might be query conditions on other dimensions • Return total severity: • Too abstract

Problem Formulation • Let R be the CPS dataset, retrieving the atypical events from R, designing a measure to represent the event and integrating the information of multiple events • Process analytical query Q in online time • We assume the atypical criteria is given and the atypical dataset can be acquired in advance

System Overview

Atypical Event • Let us examine the atypical event -- congestion in traffic monitoring system : • start from a single segment of the streets • expand along the road and influence nearby roads • may cover hundred road segments when reaching the full size • The data records in a congestion are spatially close andtimely relevant

Retrieve the Atypical Event • Scan the dataset, retrieve the atypical records and group them by a time threshold and distance threshold • The atypical event is a set of atypical records • The size is not bounded (or bounded by the size of dataset R) • Difficult to represent and integrate • Too detail -- not a good measure

Atypical Micro-Cluster • Aggregate the atypical records in one dimension • Summarize the total severity by sensors (sensor/spatial feature) • Summarize the total severity by time window (temporal features) • The size is bounded by the total numbers of sensors and time windows • Still keeping detailed information

Example in Congestion Event

Integrate the Micro-clusters • The micro-clusters represent an individual event • Atypical events may happen in similar places/time • For example, 10E highway congested in evening rush hours in weekday • For analytical purposes, it is helpful to group those similar congestions as a whole • Two sub-problems: • Which ones to merge? • How to merge?

Similarity Measure for Atypical Clusters • Basic Principles • Consider the similarity on multiple dimensions – users may specify a preference weight • Weighted measure on the data themselves (e.g., if sensor s1 report higher severities in the clusters than s2, then the weight of s1 is higher) – employ the severity as weight

Cluster Integration • For two clusters C1 and C2, the system • carry out aggregation on the feature of each dimension • for the common items, sum up their severity • keep the non-overlap items • Example • C1 {s1, 100 min; s2, 20 min} • C2 {s1, 30 min; s3, 40 min} • C23{s1, 130 min; s2, 20 min; s3, 40 min} • The spatial and temporal features are algebraic –efficient to aggregate

Macro-Clusters • The macro-clusters are generated by merging the micro-clusters • The similarities are computed among those macro-clusters and even larger ones can be further generated

Clustering Forest • The clusters make up the hierarchy of a tree • Different aggregate paths (preference on dimensions ) form a cluster forest

The Efficiency Problem on Online Query • Usually it is not realistic to materialize the entire data forest • Only some middle results (i.e., the micro-clusters in lower level cells) are pre-computed (Partial materialization) • The time complexity of the cluster integration algorithm is O(n2) • Query efficiency will be influenced if n is large –the analytical query Q(W, T) usually covers large region with long time – n is indeed large

The Effectiveness Problem • In the result, only few significant macro-clusters are generated • The remaining are the trivial ones that cannot be aggregated with others

Pruning-beforehand Strategy • Filter out the insignificant micro-clusters • The insignificant micro-clusters may integrate together and generate significant macro-clusters • Can we foretell which micro-cluster will contribute to significant macro-clusters?

Red-Zone Guided Clustering • Since it is fast to compute the total severity in a specified region • Select out the regions with high severities (red zones) • Filter out the micro-clusters locating outside those red zones • Only keep the ones in/intersect with red zones (where the significant macro-clusters may locate)

Red-Zone Guided Clustering Example

Experiment Setup • PeMS datasets from UC Berkeley • 1 year traffic data • 4,076 loop detectors in 38 freeways in California • totally 54 GB • Hardware • Inter 2200 Dual CPU @ 2.20G Hz and 2.19G Hz • 1.98 GB RAM; Windows XP SP2. • All the algorithms are implemented in Java

Model Construction • Comparing Atypical Cluster (AC) with Original CubeView (OC) and Modified CubeView (MC) • AC is an order of magnitude faster than OC

Query Efficieny • All: Do not prune; Pru: Prune beforehand; Gui: Guided Clustering • Gui cost 20% time of All, and is close to Pru

Query Effectiveness • Ground Truth: Generated by All • Pru may miss real significant macro-clusters, but Gui can guarantee the recall

Conclusions • We have investigated the problem of multi-dimensional analysis of atypical events in CPS • Atypical cluster is designed to present the event and serve as the measure for data model • The red-zone algorithm is proposed to retrieve the significant clusters for analytical query • Performance evaluation on real large datasets Thank You Very Much! Any Questions?

Multidimensional Analysis of Atypical Events in Cyber-Physical Data

Multidimensional Analysis of Atypical Events in Cyber-Physical Data

Presentation Transcript

Multidimensional Data Structures

Multidimensional Data and GIS

Model Based Safety Analysis and Verification of Cyber-Physical Systems

Multidimensional Modeling in Data warehouses

Model Based Safety Analysis of Cyber Physical Systems (CPSs)

Forecasting with Cyber-physical Interactions in Data Centers

Real-Time Data Services for Cyber Physical Systems

Multidimensional Data Analysis

Multidimensional Data Analysis : the Blind Source Separation problem.

Modelling atypical students response patterns using multidimensional parametric models

Analyzing Multidimensional Scientific Data in ArcGIS

Multivariate Data Analysis Chapter 10 - Multidimensional Scaling

Multidimensional Data

Interactive Exploration of Multidimensional Data

Centralized data warehouse and multidimensional analysis

Multidimensional Data Analysis : the Blind Source Separation problem.

Multidimensional Data

The Importance of Meteorological Data in Exceptional Events Analysis

Multidimensional Data Structures

Indexing Multidimensional Data

Multidimensional Data