1 / 37

June 13, 2007

SIGMOD 2007. Trajectory Clustering: A Partition-and-Group Framework. June 13, 2007. Jae-Gil Lee 1) , Jiawei Han 1) , and Kyu-Young Whang 2) 1) Dept. of Computer Science, UIUC, USA 2) Dept. of Computer Science, KAIST, Korea. Table of Contents. Motivation Partition-and-Group Framework

chavezk
Download Presentation

June 13, 2007

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIGMOD 2007 Trajectory Clustering: A Partition-and-Group Framework June 13, 2007 Jae-Gil Lee1), Jiawei Han1), and Kyu-Young Whang2) 1) Dept. of Computer Science, UIUC, USA 2) Dept. of Computer Science, KAIST, Korea

  2. Table of Contents • Motivation • Partition-and-Group Framework • Trajectory Clustering Algorithm: TRACLUS • Partitioning Phase • Grouping Phase • Performance Evaluation • Related Work • Conclusions

  3. Clustering • Definition: the process of grouping a set of physical or abstract objects into classes of similar objects [11] • Applications: market research, pattern recognition, data analysis, image processing, etc. • Representative algorithms: k-means [17], BIRCH [24], DBSCAN [6], OPTICS [2], and STING [22] • Target data: previous research has mainly dealt with clustering of point data

  4. Analysis on Trajectory Data • A tremendous amount of trajectory data of moving objects is being collected • Example: vehicle position data, hurricane track data, animal movement data, etc. • A typical data analysis task is to find objects that have moved in a similar way  An efficient clustering algorithm for trajectories is urgently required

  5. Limitations of Existing Algorithms • The algorithm proposed by Gaffney et al. [7, 8] clusters trajectories as a whole • Clustering trajectories as a whole could not detect similar portions of the trajectories (i.e., common sub-trajectories) • Example: if we cluster TR1~TR5 as a whole, we cannot discover the common behavior since they move to totally different directions TR4 TR5 TR3 A common sub-trajectory TR2 TR1

  6. Discovery of Common Sub-Trajectories • Discovering common sub-trajectoriesis very useful, especially if we have regions of special interest 1) Hurricane Landfall Forecasts [18] Meteorologists will be interested in the common behaviors of hurricanes near the coastline or at sea (i.e., before landing) 2) Effects of Roads and Traffic on Animal Movements [23] Zoologists will be interested in the common behaviors of animals near the road where the traffic rate has been varied • Our solution is to partition a trajectory into a set of line segments and then group similar line segments  A partition-and-group framework

  7. The Partition-and-Group Framework TR4 TR5 TR3 TR2 TR1 • Consists of two phases: partitioning and grouping (1) Partition A set of trajectories A representative trajectory (2) Group A cluster A set of line segments Note: a representative trajectory is a common sub-trajectory

  8. Problem Statement • Given a set of trajectories I = {TR1, …, TRn}, our algorithm generates a set of clusters O = {C1, …, Cm} as well as a representative trajectory for each cluster Ci • Necessary definitions: • A trajectory is a sequence of multi-dimensional points, which is denoted as TRi = p1p2p3 … pj … pleni • A cluster is a set of trajectory partitions; a trajectory partition is a line segment pipj(i < j), where piand pjare the points chosen from the same trajectory • A representative trajectory is an imaginary trajectory that indicates the major behavior of the trajectory partitions

  9. The Clustering Algorithm: TRACLUS • Based on the partition-and-group framework Algorithm TRACLUS Input: A set of trajectories I = {TR1, …, TRn} Output: (1) A set of clusters O = {C1, …, Cm} (2) A set of representative trajectories Algorithm: /* Partitioning Phase */ 01: for eachTR Ido 02: Partition TR into a set L of line segments; 03: Accumulate L into a set D; /* Grouping Phase */ 04: Group D into a set O of clusters; 05: for eachCOdo 06: Generate a representative trajectory for C;

  10. Current Step (1/3) Algorithm TRACLUS /* Partitioning Phase */ 01: for eachTR Ido 02: Partition TR into a set L of line segments; 03: Accumulate L into a set D; /* Grouping Phase */ 04: Group D into a set O of clusters; 05: for eachCOdo 06: Generate a representative trajectory for C;

  11. Characteristic Points • Identify the points where the behavior of a trajectory changes rapidly; such points are called characteristic points • A trajectory is partitioned at every characteristic point • A line segment between consecutive characteristic points is called a trajectory partition : characteristic point : trajectory partition

  12. Desirable Properties of Trajectory Partitioning conciseness preciseness characteristic points = starting and ending points characteristic points = all points • Preciseness: the difference between a trajectory and a set of its trajectory partitions should be as small as possible • Conciseness: the number of trajectory partitions should be as small as possible Note: two properties are contradictory to each other  We need to find the optimal tradeoff

  13. Minimum Description Length (MDL) Principle • The MDL principle has been widely used in information theory • The MDL cost consists of two components [9]: L(H) and L(D|H), where H means the hypothesis, and D the data • L(H) is the length, in bits, of the description of the hypothesis • L(D|H) is the length, in bits, of the description of the data when encoded with the help of the hypothesis • The best hypothesis H to explain D is the one that minimizes the sum of L(H) and L(D|H)

  14. Translation into MDL Optimization • Finding the optimal partitioning translates to finding the best hypothesis using the MDL principle • H a set of trajectory partitions, D a trajectory • L(H)  the sum of the length of all trajectory partitions • L(D|H)  the sum of the difference between a trajectory and a set of its trajectory partitions • L(H) measures conciseness; L(D|H) preciseness

  15. Approximate Trajectory Partitioning • The cost of finding the optimal partitioning is prohibitive • Use an approximate algorithm; our approximation is to regard the set of local optima as the global optimum • Algorithm skeleton (See Fig. 8 in the paper): • Compute the MDL costs both when a point pk is a characteristic point and when it is not Choose pk-1 as a characteristic point, if the former > the latter Advance pk by increasing k, otherwise approximate solution

  16. Current Step (2/3) Algorithm TRACLUS /* Partitioning Phase */ 01: for eachTR Ido 02: Partition TR into a set L of line segments; 03: Accumulate L into a set D; /* Grouping Phase */ 04: Group D into a set O of clusters; 05: for eachCOdo 06: Generate a representative trajectory for C;

  17. Distance between Line Segments • The weighted sum of three components: the perpendicular distance( ), parallel distance( ), and angle distance( ) • Adapted from similarity measures used in the domain of pattern recognition [4] Remark: the sum of the distances between endpoints does not work well for line segment clustering

  18. Density of Line Segments • Change the definitions for points, originally proposed for DBSCAN [6], to those for line segments • Def. (ε-neighborhood): Nε(Li) = {Lj ∈ D | dist(Li, Lj) ≤ ε} • Def. (core line segment): Li is a core line segment w.r.t. ε and MinLns if |Nε(Li)| ≥ MinLns • Def. (directly density-reachable): Li directly density-reachable from Lj w.r.t. ε and MinLns if Li ∈ Nε(Lj) and |Nε(Lj)| ≥ MinLns • Def. (density-reachable): Transitive closure of directly density-reachability • Def. (density-connected set ≡ cluster): 1) Maximal w.r.t. density-reachability 2) Any line segments are density-connected, i.e., density-reachable from a third line segment

  19. Density of Line Segments (cont’d) MinLns = 3 L4 L2 L5 L3 L1 L6 L6 L5L3L1L2L4 • Example: • L1, L2, L3, L4, and L5 are core line segments • L2 (or L3) is directly density-reachable from L1 • L6 is density-reachable from L1, but not vice versa • L1, L4, and L5 are all density-connected Note: the shape of an ε-neighborhood is likely to be an ellipse rather than a circle

  20. Examples of ε-neighborhoods Red lines: core line segments, Blue lines: line segments in the ε-neighborhood

  21. Line Segment Clustering • Algorithm skeleton (See Fig. 8 in the paper): • Select an unprocessed line segment L • Retrieve all line segments density-reachable from L w.r.t. ε and MinLns • If L is a core line segment, a cluster is formed • Otherwise, L is marked as a noise • Continue this process until all line segments have been processed • Filter out clusters whose trajectory partitions have been extracted from too few trajectories • Time complexity (See Lemma 3 in the paper): • O(n2): if an index does not exist • O(nlogn): if an index does exist

  22. Heuristic for Parameter Value Selection • Estimation of ε • Find the value of ε that minimizes theentropyof |Nε(L)| • Good clustering: |Nε(L)| tends to be skewed  the entropy is small • Worst clustering: |Nε(L)| tends to be uniform  the entropy is large • Estimation of MinLns • Choose one from avg(|Nε(L)|) + 1 ~ 3 • MinLns should be larger than avg(|Nε(L)|) to discover meaningful clusters Nε(L) The optimal ε Nε(L) Too small ε→ every |Nε(L)| = 1 Too large ε→ every |Nε(L)| = # of line segments

  23. Current Step (3/3) Algorithm TRACLUS /* Partitioning Phase */ 01: for eachTR Ido 02: Partition TR into a set L of line segments; 03: Accumulate L into a set D; /* Grouping Phase */ 04: Group D into a set O of clusters; 05: for eachCOdo 06: Generate a representative trajectory for C;

  24. Representative Trajectories • Describe the overall movement of the trajectory partitions that belong to the cluster • Correspond to common sub-trajectories • Can be considered a model [1] for clusters • Useful for domain experts to understand the movement in the trajectories

  25. Representative Trajectory Generation MinLns = 3 8 7 6 5 4 3 2 sweep 1 • Sweep a vertical line in the direction of the major axis • Compute the average w.r.t. the average direction vector average direction vector average coordinate in the coordinate system

  26. An Example of a Representative Trajectory A red line: a representative trajectory, A blue line: an average direction vector, Pink lines: line segments in a density-connected set

  27. A Quick View of a Clustering Result Simple synthetic data: 200 trajectories (25% are noises)

  28. Performance Evaluation • Use two real trajectory data sets • Hurricane track data set • Record the Atlantic hurricanes from the years 1950 through 2004 • Contain 570 trajectories and 17,736 points • Animal movement data set • Record the locations of elk, deer, and cattle from the years 1993 through 1996 (the Starkey project) • Elk1993: Contain 33 trajectories and 47,204 points; Deer1995: Contain 32 trajectories and 20,065 points • Validate the clustering quality 1) Estimate the parameter values for ε and MinLns 2) Try a few values around the estimated ones; determine the optimal parameter values by visual inspection

  29. Effectiveness of Parameter Estimation • Entropies depending on the value of ε ε with the minimum entropy: an estimated value (a) Hurricane Tracks (b) Elk1993 The optimal valueis very close to the estimated value  The accuracy of our heuristic is quite high

  30. Clustering Result: Hurricane Tracks ε = 30 and MinLns = 6 → # of clusters = 7

  31. Clustering Result: Elk1993 ε = 27 and MinLns = 9 → # of clusters = 13

  32. Clustering Result: Deer1995 ε = 29 and MinLns = 8 → # of clusters = 2

  33. Effects of the Parameter Values A larger εor a smaller MinLns a smaller number of larger clusters e.g., ε= 33 and MinLns = 6  5 clusters (132 line segments) A smaller εor a larger MinLns  a larger number of smaller clusters e.g., ε= 26 and MinLns = 6  13 clusters (31 line segments)

  34. Related Work • Clustering algorithms for points • Clustering algorithms for trajectories [7, 8] • Based on probabilistic clustering • Cluster trajectories as a whole • Distance measures for trajectories: LCSS [21] and EDR [5] • Based on the edit distance • Designed to compare the whole trajectory (time series) • Applications of the MDL principle [3, 13] • Graph partitioning (cross-association) • Distance function design for strings: CDM • Polyline simplification • Require additional parameters • Developed mainly for the Euclidean distance

  35. Challenging Issues • Efficiency • Use an index to execute ε-neighborhood queries • Not easy because our distance function is non-metric • Parameter insensitivity • Make our algorithm more insensitive to parameter values • Movement patterns • Support various types of movement patterns, especially circular motion • Temporal information • Take account of temporal information during clustering

  36. Conclusions • Proposed a novel framework, the partition-and-group framework, for clustering trajectories • Developed the trajectory clustering algorithm TRACLUS based on this framework • Demonstrated the effectiveness of TRACLUS using various real trajectory data  Provided a new paradigm in trajectory clustering

  37. Thank You!

More Related