200 likes | 224 Views
Time-focused density-based clustering of trajectories of moving objects. Margherita D’Auria Mirco Nanni Dino Pedreschi. Plan of the talk. Introduction Motivations Problem & context Density-based Clustering (OPTICS) Density-based clustering on trajectories
E N D
Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi
Plan of the talk • Introduction • Motivations • Problem & context • Density-based Clustering (OPTICS) • Density-based clustering on trajectories • Trajectory data model distance measure • Results • Temporal Focusing • A clustering quality measure • Heuristics for optimal temporal interval • Conclusions & future work
Motivations • Plenty of actual and future data sources for spatio-temporal data • Sophisticated analysis method are required, in order to fully exploit them • Data mining methods • Which kind of patterns/models? • Main objectives • A better understanding of the application domain • An improvement for private and public services
Problem & context • A distinguishing case: Mobile devices • PDAs • Mobile phones • LBS-enabled devices (may include the two above) • They (can) yield traces of their movement • An important problem: • Discovering groups of individuals that (approx.) move together in some period of time • E.g.: detection of traffic jams during rush hours • A candidate Data Mining reformulation of the problem • Clustering of individuals’ trajectories
Which kind of clustering? • Several alternatives are available • General requirements: • Non-spherical clusters should be allowed • E.g.: A traffic jam along a road • It should be represented as a cluster which individuals form a “snake-shaped” cluster • Tolerance to noise • Low computational cost • Applicability to complex, possibly non-vectorial data • A suitable candidate: Density-based clustering • In particular, we adopt OPTICS
A crushed intro to OPTICS • A density threshold is defined through two parameters: • ε: A neighborhoodradius • MinPts: Minimum number of points • Key concepts: • Core objects • Objects with a ε-Neighborhood that contains at least MinPts objects • Reachability-distance reach-d( p, q ) • (simplified definition:) Distance between objects p and q • Example: • Object “q” is a core object if MinPts=2 • Object “p” is not • Their reach-d() is shown ε q reach-d(p,q) p ε –neighborhood of q
A crushed intro to OPTICS The algorithm: • Repeatedly choose a non-visited random object, until a core object is selected • Select the core object having the smallest reachability distance from all the visited core objects. If none can be found, go to step 1 Output: reach-d() of all visited points (reachability plot) Order of visit “jump” from left-hand group (0-9) to right-hand one (10-18) Reachability threshold Cluster 1 Cluster 2
Applying OPTICS to trajectories • Two key issues have to be solved • A suitable representation for trajectories is needed • Which data model for trajectories? • A mean for comparing trajectories has to be provided • Which distance between objects? • OPTICS needs to define one to perform range queries
A trajectory data model • Raw input data: • Each trajectory is represented as a set of time-stamped coordinates • T=(t1,x1,y1), …, (tn, xn, yn) => Object position at time ti was (xi,yi) • Data model • Parametric-spaghetti: linear interpolation between consecutive points
A distance between trajectories • Adopted distance = average distance • It is a metric => efficient indexing methos allowed
A sample dataset • Set of trajectories forming 4 clusters + noise • Generated by the CENTRE system (KDDLab software)
OPTICS vs. HAC & K-means K-means HAC-average OPTICS
Temporal focusing • Different time intervals can show different behaviours • E.g.: objects that are close to each other within a time interval can be much distant in other periods of time • The time interval becomes a parameter • E.g.: rush hours vs. low traffic times • Problem: significant time intervals are not always known a priori • An automated mechanism is needed to find them
Temporal focusing • The proposed method • Provide a notion of interestingness to be associated with time intervals • We define it in terms of estimated quality of the clustering extracted on the given time interval • Formalize the Temporal focusing task as an optimization problem • Discover the time interval that maximizes the interestingness measure
A quality measure for density-based clustering • General principle • High-density clusters separated by low-density noise are preferred • The method • High-density clusters correspond to low dents in the reachability plot => Evaluate the global quality Q of the clustering output as the average reachability within clusters (noise is discarded) LOW DENSITY MEDIUM DENSITY HIGH DENSITY • Definition: given ε and dataset D, compute QD, ε as: QD, ε = - R (D, ε’) = - AVGo in D’ reach-d(o) D’ = D – {noise objects}
FAQs • How Q() is computed for a given time interval I ? • Step 1: trajectory segments out of I are clipped away • Step 2: OPTICS is run on the clipped trajectories • Step 3: Q(I) is computed on the output reachability plot • How is the reachability threshold set for each interval? • A reachability threshold is needed in order to locate clusters (and noise) • The threshold for the largest I is manually set by the user • Thresholds for other intervals I’ I are computed from the first one by proportionally rescaling w.r.t. average reachability • Is the optimal Q(I) biased towards tiny intervals? • Yes. The problem has been fixed by defining Q’(I) = Q(I) / log |I| => A small decrease in Q(I) is accepted when it yields a much larger I
Esperiments • A more complex sample dataset (generated by CENTRE) • Clear clusters in the central time interval vs. dispersion on the borders
Optimizing Q() • Find the optimal Q() by plotting values for all time intervals • The optimum corresponds to the central time interval
Heuristics for optimum search • Each Q() value computation requires a run of the OPTICS algorithm • Computing all O(N2) values is too expensive (N=|{sub-intervals}|) • Alternative approaches are needed • Preliminary tests with hill-climbing (i.e., greedy) approach: • Test on the same dataset • Global optimum found in the 70,7% of runs • Avg. number of steps: 17 • Avg. OPTICS runs: 49 starting points global optimum local optima
Conclusions & Future works • Summary of the work • Extension of OPTICS to a trajectory data model & distance • Definition of the Temporal Focusing problem • Definition of a clustering quality measure • (Preliminary) Tests with exhaustive & greedy optimization • Future work • Experimental validation over broader benchmarks • Tighter integration between OPTICS and search strategy • Alternative, domain-specific definition of quality measures