440 likes | 693 Views
Trajectory Data Mining and Management. Hsiao-Ping Tsai 蔡曉萍 @ CSIE, YuanZe Uni. 2009.12.04 . Outline. Introduction to Data Mining Background of Trajectory Data Mining Part I: Group Movement Patterns Mining Part II: Semantic Data Compression.
E N D
Trajectory Data Mining and Management Hsiao-PingTsai 蔡曉萍@ CSIE, YuanZe Uni. 2009.12.04
Outline • Introduction to Data Mining • Background of Trajectory Data Mining • Part I: Group Movement Patterns Mining • Part II: Semantic Data Compression
Data Mining – Automated analysis of massive data sets Why Data Mining? • The explosive growth of data - toward petabyte scale • Commerce: Web, e-commerce, bank/Credit transactions, … • Science: Remote sensing, bioinformatics, … • Many others: news, digital cameras, books, magazine,… • We are drowning in data, but starving for knowledge! Somebody~ Help~~~~
Database Technology Statistics Machine Learning Visualization Data Mining Pattern Recognition Other Disciplines Algorithm Graph Theory Neural Network What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)knowledge, e.g., rules, regularities, patterns, constraints, from huge amount of data Confluence of Multiple Disciplines
Potential Applications • Data analysis and decision support • Market analysis and management • Risk analysis and management • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining and Web mining • Stream data mining • Bioinformatics and bio-data analysis • …
Data Mining Functionalities (1/2) • Multidimensional concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics • Frequent patterns, association, correlation vs. causality • Diaper Beer [0.5%, 75%] • Discovering relation between data items. • Classification and prediction • Construct models that describe and distinguish classes • Predict some unknown or missing numerical values
Data Mining Functionalities (2/2) • Cluster analysis • Clustering: Group data to form classes • Maximizing intra-class similarity and minimizing interclass similarity • Outlier analysis • Outlier: Data object that does not comply with the general behavior of the data • Useful in fraud detection, rare events (exception) analysis • Trend and evolution analysis • Trend and deviation, e.g., regression analysis • Sequential pattern mining, Periodicity analysis, Similarity-based analysis
Outline • Introduction to Data Mining • Background of Trajectory Data Mining • Part I: Group Movement Patterns Mining • Part II: Semantic Data Compression
Trajectory data are everywhere! • The world becomes more and more mobile • Prevalence of mobile devices, e.g., smart phones, car PNDs, NBs, PDAs, … • Satellite, sensor, RFID,and wireless technologies have fostered many applications • Tremendous amounts of trajectory data Market Prediction: 25-50% of cellphones in 2010 will have GPS
Related Research Projects (1/2) • GeoPKDD: Geographic Privacy-aware Knowledge Discovery and Delivery (Pisa Uni., Priaeus Uni, …) • MotionEye: Querying and Mining Large Datasets of Moving Objects (UIUC) • GeoLife: Building social networks using human location history (Microsoft Researcch) • Reality Mining (MIT Media Lab) • Data Mining in Spatio-Temporal Data Sets (Australia's ICT Research Centre of Excellence) • Trajectory Enabled Service Support Platform for Mobile Users' Behavior Pattern Mining (IBM China Research Lab) • U.S. Army Research Laboratory
Related Research Prjoects (2/2) • Mobile Data Management (李強教授@CSIE.NCKU) • Energy efficient strategies for object tracking in sensor networks: A data mining approach (曾新穆教授@CSIE.NCKU) • Object tracking and moving pattern mining (彭文志教授@CSIE.NCTU) • Mining Group Patterns of Mobile Users (黃三義教授@CSIE.NSYSU) • …
Wireless Sensor Networks (1/2) • Technique advances in wireless sensor network (WSN) are promising for various applications • Object Tracking • Military Surveillance • Dwelling Security • … • These applications generate large amounts of location-related data, and many efforts are devoted to compiling the data to extract useful information • Past behavior analysis • Future behavior prediction and estimation
Wireless Sensor Networks (2/2) • A wireless sensor network (WSN) is composed of a large number of sensor nodes • Each node consists of sensing, processing, and communicating components • WSNs are data driven • Energy conservation is paramount among all design issues • Object tracking is viewed as a killer application of WSNs • A task of detecting a moving object’s location and reporting the location data to the sink periodically • Tracking moving objects is considered most challenging
Part I: Group Movement Patterns Mining Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Group Moving Pattern forTracking Objects Efficiently,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE) , 2009
Motivation • Many applications are more concerned with the group relationships and their aggregated movement patterns • Movements of creatures have some degree of regularity • Many creatures are socially aggregated and migrate together • The application level semantics can be utilized to track objects in efficient ways • Data aggregation • In-network scheduling • Data compression
Assumptions • Objects each have a globally unique ID • A hierarchical structure of WSN, where each sensor within a cluster has a locally unique ID, ex. • Location of an object is modeled by the ID of a nearby sensor (or cluster) • The trajectory of a moving object is thus modeled as a series of observations and expressed by a location sequence
Problem Formulation • Similarity • Given the similarity measure function simpand a minimal threshold simmin, oi and oj are similar if their similarity score simp(oi, oj) is above simmin, i.e., • Group • A set of objects g is a group if , where so(oi) denotes the set of objects that are similar to oi • The moving object clustering (MOC)problem: Given a set of moving objects O together with their associated location sequence dataset S and a minimal threshold simmin, the MOC problem is formulated as partitioning O into non-overlapped groups, denoted by G = {g1, g2, ..., gi}, such that the number of groups is minimized, i.e., |G| is minimal.
Challenges of the MOC Problem • How to discover the group relationships? A centralized approach? Compare similarity on entire movement trajectories? A distributed mining approach is more desirable • Other issues • Heterogeneous data from different tracking configurations • Trade-off between resolution and privacy-preserving Local characteristics might be blurred! Compiling all data at a single node is expensive!
The proposed DGMPMine Algorithm Provide Transmission Efficiency • To resolve the MOC problem, we propose a distributed group movement pattern mining algorithm Provide Flexibility Improve discriminability Preserve Privacy Improve Clustering Quality
Definition of a Significant Movement Pattern • A subsequence that occurs more frequently carries more information about the movement of an object • Movement transition distribution characterizes the movements of an object • Definition of a movement pattern • A subsequence s of a sequence S is significant if its occurrence probability is higher than a minimal threshold, i.e., P(s) ≥ Pmin • A significant movement pattern is a significant subsequence s together with its transition distribution P(δ|s) with the constraint that P(δ|s)of s must differ from P(δ|suf(s))with a ratio r or 1/r VMM
Learning of Significant Movement Patterns • Leaning movement patterns in the trajectory data set by Probabilistic Suffix Tree • PST is an implementation of VMM with least storage requirement • The PST building algorithm learns from a location sequence data set and generate a compact tree with O(n) complexity in both computing and space • Storing significant movement patterns and their empirical probabilities and conditional empirical probabilities of movement patterns • Advantages: • Useful and efficient for prediction • Controllable tree depth (size)
Prediction Complexity: Example of a location sequence and the generated PST
Similarity Comparison • A novel pattern-based similarity measure is proposed to compare the similarity of objects. • Measuring the similarity of two objects based on their movement patterns • Providing better scalability and resilience to outliers • Free from sequence alignment and variable length handling • Considering not only the patterns shared by two objects but also their relative importance to individual objects • Provide better discriminability
Euclidean distance of a significant pattern regarding to Ti and Tj normalizationfactor The Novel Similarity Measure Simp • Simpcomputes the similarity of objects oi and ojbased on their PSTs as follows, : The predicted value of the occurrence probability of s based on Ti : The union of significant patterns on the T1 and T2 : The maximal length of VMM (or maximal depth of a PST) Σ : The alphabet of symbols (IDs of a cluster of sensors)
Local Grouping Phase--The GMPMine Algorithm • Step 1. Learning movement patterns for each object • Step 2. Computing the pair-wise similarity core to constructing a similarity graph • Step 3. Partitioning the similarity graph for highly-connected sub graphs • Step 4. Choosing representative movement patterns for each group
Global Ensembling Phase • Inconsistency may exist among local grouping results • Trajectory of a group may span cross several clusters • Group relationships may vary at different locations • A CH may have incomplete statistics • … • A consensus function is required to combine multiple local grouping results • remove inconsistency • improve clustering quality • improve stability
join entropy entropy Global Ensembling Phase (contd.) • Normalized Mutual Information (NMI) is useful in measuring the shared information between two grouping results • Given K local grouping results, the objective is to find a solution that keeps most information of the local grouping results
The CE Algorithm • For a set of similarity thresholds D, we reformulate our objective as • The CE algorithm includes three steps: • Measuring the pair-wise similarity to construct a similarity matrix by using Jaccard coefficient • Generating the partitioning results for a set of thresholds based on the similarity matrix • Selecting the final ensembling result
(a) Similarity graph (δ=0.1). (b) Highly connected subgraphs (δ=0.1). Example of CE
Part II: Semantic Data Compression Hsiao-Ping Tsai, De-Nian Yang, and Ming-Syan Chen, “Exploring Application LevelSemantics for Data Compression,” accepted by IEEE Trans. on Knowledge and Data Engineering (TKDE), 2009
Introduction • Data transmission of is one of the most energy expensive operations in WSNs • A batch-and-send network • NAND flash memory reduce network energy consumption • Data compression is a paradigm in WSNs increase network throughput However, few works address application-dependent semantics in data, such as the correlations of a group of moving objects • How to manage the location data for a group of objects? • Compress data by general algorithms like Huffman? • Compress a group of trajectory sequences simultaneously?
Group relationships Vertical redundancy Horizontal redundancy Statistics of symbols Motivation • Redundancy in a group of location sequences comes from two aspects • Vertical redundancy • Horizontal redundancy Predictability of symbols
What is Predictability of Symbols? • With group movement patterns shared • Predict the next location (symbol) Replacing predictable items with a common symbol helps reduce entropy!
Problem Formulation • Assume • A batch-based tracking network • Group movement patterns are shared between a sender and a receiver • The Group Data Compression (GDC) Problem Given the group movement patterns of a group of objects, the GDC problem is formulated as a merge problem and a hit item replace (HIR) problem to reduce the amount of bits required to represent their location sequences. • The merge problem is to combine multiple location sequences to reduce the overall sequence length • the HIR problem targets to minimize the entropy of a sequence such that the amount of data is reduced with or without loss of information
Our Approach Guarantee the reduction of entropy • The proposed two-phase and two-dimensional (2P2D) algorithm • Sequence merge phase • Utilizing the group relationships to merge the location data of a group of objects • Entropy reduction phase • Utilizing the object movement patterns to reduce the entropy of the merged data horizontally Compressibility is enhanced w/ or w/o information loss Uncompressing… Compressing…
Sequence Merge Phase • We propose the Merge algorithm that • avoids redundant reporting of their locations by trimming multiple identical symbols into a single symbol • chooses a qualified symbol to represent multiple symbols when a tolerance of loss of accuracy is specified • The maximal distance between the reported location and the real location is below a specified error bound eb • While multiple qualified symbols exist, we choose a symbol to minimize the average location error 60 symbols -> 20 symbols
Entropy Reduction Phase • Group movement patterns carry the information about whether an item of a sequence is predictable • Since some items are predictable, extra redundancy exists How to remove the redundancy and even increase the compressibility?
Entropy Reduction Phase • According to Shannon’s theorem, the entropy is the upper bound of the compression ratio • Definition of entropy • Increasing the skewness of the data is able to reduce the entropy
The Hit Item Replacement (HIR) Problem • A simple and intuitive method is to replace all predictable symbols to increase the skewness • However, the simple method can not guarantee to reduce the entropy Definition of the Hit Item Replace (HIR) problem: Given a sequence and the information about whether each item is predictable, the HIR problem is to decide whether to replace each of the predicted symbols in the given sequence with a hit symbol to minimize the entropy of the sequence.
Three Rules • Accumulation rule: • Concentration rule: • Multi-symbol rule:
~The End~ Any Question ?