Searching Trajectories by Locations – An Efficiency Study

Zaiben Chen1, Heng Tao Shen1, Xiaofang Zhou1, Yu Zheng2, Xing Xie2 1 The University of Queensland 2 Microsoft Research, Asia Searching Trajectories by Locations – An Efficiency Study

Outline • Research problem & application scenarios • Basic ideas • K Best-Connected Trajectory (k-BCT) query • The Incremental k-NN Algorithm (IKNN) • Performance study • Best-first • Depth-first • Optimization & extension • Experiments • Conclusion

Research Problem: Searching Trajectory Databases How to retrieve the trajectories we want? GPS trajectories collected by GeoLife Project, MSRA

Searching Trajectory Databases • Search by a location • Search by a sample trajectory Frentzos et al. Geoinfomatica07; Dfoser et al. VLDB00. (R-tree variants) Chen et al, SIGMOD05; Vlachos et al, ICDE02; Yi et al, ICDE98, etc. (Similarity)

Searching Trajectory Databases • The problem we study: Searching by multiple locations • To find trajectories that are ‘close’ to all the locations • Technically, it is an extension of the single-location based query. But more complicated. • Practically, it produces a more general way to search trajectories. Two extreme cases (one location, many locations)

Application motivations • The Microsoft GeoLife Project http://research.microsoft.com/en-us/projects/geolife/ GeoLife is a location-based service built on Microsoft Virtual Earth. Our work benefits the following two functions (1) Travel recommendation E.g. To help a visitor planning a trip to multiple attractions by considering other’s traveling trajectories. (2) Sharing life experiences & friend recommendation E.g. To find out which users share the similar daily route through Queens Plaza, Central Stat., Mains St.

Application motivations Geo-Coding: From Pictures to Coordinates The recommended route

Application motivations Geo-Coding: From Pictures to Coordinates The first step: to define the closeness (i.e. distance) between a trajectory and locations The recommended route

Similarity Function • The similarity function reflects how close a trajectory is to the given locations, and we call the most similar trajectory the best-connected trajectory. • Step 1. find out the closest trajectory point on R to each location qi • Step 2. sum up the contribution of each matched pair. (unordered query) Distq(qi, R) is the shortest distance from qi to R Q={q1, q2, … qm}, R={p1, p2, … pn}

Problem Definition • k-Best Connected Trajectory (k-BCT) query Given a set of trajectories T = {R1, R2, … , Rn}, a set of query locations Q = {q1, q2, … ,qm}, and the similarity function Sim(Q, R), the k-BCT query is to find the k trajectories among T that have the highest similarity. Assumption: The number of query locations is small. (m is a small constant) Intuition: The k-BCT result is the JOIN of m single-location based queries.

Basic ideas Incremental k-NN Algorithm (IKNN) • Step 1. Index all the trajectory points by one single R-tree • Get the shortest distance from a query location to the trajectories • Step 2. Search for the λ-nearest neighbor (λ-NN) of each query location (q1 to qm), by using any traditional k-nearest neighbor algorithm over R-tree. For any trajectory that scanned by a λ-NN, it’s shortest distance to the query point is known. Candidate set C = {all scanned trajectories}

IKNN algorithm • Step 3. Construct lower bounds of similarity. For a trajectory R1 in C, assume it got 3 points p1, p2 and p3 scanned by the λ-NN search of q1, q2. p5 R1 p1 p2 p3 q1 q2 q3 Sim(Q, R1) = e-|q1, p1| + e-|q2, p2| + e-|q3, p5| ≥ e-|q1, p1| + e-|q2, p2|

The Incremental k-NN algorithm • Step 4. Construct upper bound of similarity. For any trajectory that is not covered by the λ-NN search, e.g. R5 it’s distance to qi must be larger than the radius of qi R1 radius1 radius2 radius3 q1 q2 q3 R5 Sim(Q, R5) = e-|q1, R5| + e-|q2, R5| + e-|q3, R5| ≤ e-radius1+ e-radius2 + e-radius3

The Incremental k-NN algorithm • Step 5. Check the STOP condition (pruning condition) For a k-BCT query, if we can get k candidate trajectories whose lower bounds are not less than the upper bound of similarity for all un-scanned trajectories, then the k best-connected trajectories must be included in the candidate set. if the condition is satisfied go to the refinement step else increase λ by some Δ repeat the search process With the search region of the λ-NN search enlarges, eventually k best-connected trajectories will be found.

Problem • The problem: we may need to increase λ and compute the lower/upper bounds for many rounds before we eventually find the k-BCT results. • The λ-NN search will run for many rounds for every query location. (let λ be a constant k initially, and Δ be k as well) round 1: 1 – k nearest neighbors round 2: 1 – 2k nearest neighbors … round i: 1 – i*k nearest neighbors Trajectory points are visited multiple times. Normally, λ >> k, so the complexity is λ^2.

Problem • The problem: we may need to increase λ and compute the lower/upper bounds for many rounds before we eventually find the k-BCT results. • The λ-NN search will run for many rounds for every query location. (let λ be a constant k initially, and Δ be k as well) round 1: 1 – k nearest neighbors round 2: 1 – 2k nearest neighbors … round i: 1 – i*k nearest neighbors Normally, λ >> k, so the complexity is lambda square. Can we reduce the overlapped search regions?

Efficiency study of the IKNN • Adaption of the λ-NN algorithm • The best-first nearest neighbor search [Hjaltason et al., TODS99] A priority queue is maintained to store all the R-tree entries that have yet to be visited, using the MINDIST as a key. So it visits MBRs/Objects in the order of the MINDIST. • The depth-first nearest neighbor search [Roussopoulos et al., SIGMOD95] It recursively traverses the R-tree level by level in a depth-first manner, while maintaining a global list of k nearest candidates found so far. • Estimate the performance of the IKNN adopting different λ-NN algorithms

Adaption of the λ-NN algorithm • The best-first NN search • Retrieve the λ, λ+∆, λ+2∆, … NN for each query location incrementally until the k best-connected trajectories are included in the candidate set. • Benefit The λ-NN is returned in an incremental way I/O optimal, no overlap occurs, Vsum = λ • Shortcoming Memory consumption is NOT guaranteed. A priority queue is maintained to store all the R-tree entries that have yet to be visited. The queue may be as large as the whole dataset in an extreme case.

The best-first strategy • Performance (R-tree leaf access) • Estimate the circle region (with radius r) that contains λ points [Belussi et al. VLDB95] • Estimate the leaf access of a range query with radius r [Korn et al. TKDE2001] • m independent λ-NN queries λ objects q radius

Adaption of the lambda-NN algorithm • The depth-first NN search • Every time we search for the λ+∆ NN, we have to re-visit the search region of the λ-NN query. • Benefit: Guaranteed memory usage, O(c LogcN) • Drawback: Too many overlaps • A simple improvement: Double λ at each round, to reduce the number of rounds and amortize cost. • Pruning: All MBRs whose MAXDIST is even smaller than the current search range of λ-NN can be skipped in the search of λ+∆ NN.

The depth-first strategy • Performance (R-tree leaf access) The search region is not necessary a circle! So we can not use the previous method directly. • Estimate the size of the first visited MBR (at any level) that contains not less than λ points • Estimate the radius (MAXDIST) of the region that contains the MBR MBR1 MAXDIST qi R-tree nodes outside the circle with radius MAXDIST wont be visited.

The depth-first strategy (cont.) • Performance • Estimate the leaf access of a range query with radius MAXDIST [Korn et al. TKDE2001] • Finally,

Summary The best-first strategy, although has no guarantee in memory usage, it normally runs faster and the priority queue can still be accommodated in the main memory of a modern computer easily. The modified depth-first strategy reaches nearly the same performance as that of the best-first strategy, while it still preserves a low memory consumption

Optimization & Extension • Considering the importance of the query locations and assigning different weights in exploring objects. • Extension to query locations with an order specified

Experiments • 12, 653 trajectories (1,147,116 points) collected by the Geolife project • Number of query locations: 2 to 10 • Tests are conducted on PC with 2.1GHz CPU and 1GB memory

Experiments – Node Access

Experiments – Query Time

Experiments – Memory Usage

Thank you

Searching Trajectories by Locations – An Efficiency Study