300 likes | 395 Views
Top-k Queries on Temporal Data. Feifei Li 1 , Ke Yi 2 , Wangchao Le 1 Florida State University HongKong University of Science & Technology. Problem Def. Temporal data: temporal data refer to data that change over time. Typical examples - stock traces - objects’ trajectories. .
E N D
Top-k Queries on Temporal Data Feifei Li1, Ke Yi2, Wangchao Le1 Florida State University HongKong University of Science & Technology
Problem Def. • Temporal data: temporal data refer to data that change over time. • Typical examples - stock traces - objects’ trajectories. Score Time
Problem Def. • For the efficiency of storage, indexing , queries, etc., time series are often represented as piecewise linear functions, each called a Piecewise Linear Approximation (PLA). Score Score Time Time Each PLA is called an object. An PLA object with 4 line segments.
Problem Def. (cont.) • Ranking Queries on Temporal data : top-k queries on time instants. Given a set of PLA objects {oi|i=1 … n}, a time instant t and k, a top-k/t query retrieves the k objects that have the highest scores on time instant t.
State of the Art • Use R-tree R-tree revisit: - Index multi-dim. info. - linear space - Branch and bound with a priority queue - Do NOT have a worst case query cost guarantee (linear scan in worst case). • Treat an object as a trajectory - Break up each trajectory into pieces of segments - R-tree is built on pieces of segments • Use kNN query at time t -Adding an artificial query point that is high enough (example in next slide).
State of the Art (cont.) • kNN query at time t using R-tree - Use min. snapshot distance (MinSTDist.), distance along time instance t from q. • Branch & bound with MinSTDist • - Stop when there are k objects in the priority queue whose MinSTDist are smaller than other unseen objects.
State of the Art (cont.) • Efficiency of R-tree based approach - Linear space consumption - Handle queries on higher dimensional problems • Deficiency of R-tree based approach - Do not have worse case performance guarantee (build, query) - Current commercial DBMSs have limited supports on R-tree
Our contribution • We propose seb-tree, the Sampled Envelope B-tree. • Simplicity - B-tree is the only building block , easily to integrate into commercial DBMSs • Optimal query performance - Answer a top-k/t query in logarithm I/O on expectation • Handle update - 99.5% updates will end up in simple insertions/deletions - Only 0.5% updates need to lock and modify a larger portion of the B-tree • Size & construction - Occupy near linear space - Require near linear time to build.
Seb-tree (rand. sampling) • Let S be a set of N line segments in the plane • Build series of random sampling on S - Define l independent sampling ratio pi (0≤i≤l) - Sampling on S with pi - Sampled set Si & unsampled set USi - l+1 groups of Si and USi • How to decide l and pi? - , kmax is the highest possible k - pi is a geometrically decreasing series : 1/(2iB), i= 0, 1, …, l, B is the # of segments can be hold in a disk block
Seb-tree ( the upper envelope) • For each sample Si, compute its upper envelope envi - What’s upper envelope? • Upper envelope can be computed in near linear time (1989) A random sampled set Si Si and its upper envelope envi
Seb-tree ( the trapezoidal decomp.) • For each vertex on envi - shoot up a vertical line - if it is an endpoint of a segment, also shown down until it hits another segment or score=0. • This results the trapezoidal decomposition of Si: D(Si). Si and its decomposition Si and its upper envelope envi
Seb-tree (the conflict list) • Conflict - consider a trapezoid ∆ from some D(Si) and s USi - we say s conflicts with ∆ if s intersects ∆ • Conflict list - for each ∆, find all s USiconflicted with it (do we need to consider s Si?) - collect all such segments into a list, which is named conflict list C(∆) Sa Sd C(∆)= {Sa, Sb, Sc, Sd, Se} Sb Sc Se ∆
Seb-tree (the index) • Let ∆1, ∆2, …, ∆t be the trapezoids of D(Si) from left to right - sort by the starting x value of ∆ • Build a B-tree Ti on C(∆1), C(∆2), …, C(∆t) in order • Build a B-tree for each level of sampling - totally we have l+1 B-trees
Size of seb-tree Lemma 1 (1989): E(|C(∆)|)=O(1/p) By Lemma1, for a ∆ on level i, E(|C(∆)|)=O(2iB) Lemma 2 (1986): There are O(n*α(n)) vertices on the upper envelope of n line segments in the plane, where α(n) is the inverse Ackermann function and can be treated as a constant of all imaginable input size. - for Si, it has expected O(1/2i*N/B* α(N/B)) trapezoids - for B-tree Ti, it occupied O(N*α(N/B)) blocks. • Size of seb-tree ForB-trees, the size of seb-tree is
More on seb-tree • Each line segment might intersect with multiple trapezoids • How to build the conflict list efficiently • Hierarchical decomposition • Conflict lists can be build in near linear time.
The hirarchical decomposition • Let L0 be the set of segments in Si, we then build a gradation where Lj is ½ sampling of Lj-1, λ=O(log|L0|) L0 L1 L2
The hirarchical decomposition • For each Lj, we build its trapezoidal decomposition D(Lj) L0 L1 L2
The hierarchical decomposition • For each Lj, we build its trapezoidal decomposition D(Lj) • We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(Lλ) L0 L1 L2
The hierarchical decomposition • For each Lj, we build its trapezoidal decomp. D(Lj) • We further partition D(Lj) with the vertical dividing line from higher levels D(Lj+1), … , D(Lλ) • Store all trapezoids in this hierarchy in a tree (HDT). L0 L1 L2
The hierarchical decomposition • To judge which C(∆) a line segment belongs to at L0, we search top-down from Lλ, visiting a ∆ if only if the segment intersect with it. f g L0 seg2 seg1 d e b seg2 L1 b a L2 seg2 seg1
Cost on building conflict lists For a particular level Si, the decomp. has a height of λ=O(log|Si|) • For a segment s, the time it spent to visit the HDT will be proportional to the size of the HDT, which is • At Lj, its conflict list has an expected size E(|C(∆)|)=O(2i+jB) • |Lj|= O(N/2i+jB), there are O(|Lj|α(|Lj|)) trapezoids in D(Lj), so D(Lj) has an expected size of O(N*α(N/B)*log(N/B)) • The total time spent on the entire l+1 HDTs is
Query on seb-tree • Query on seb-tree is simple (in 1 for-loop) - Given k and a time instant t, initiate i=0 1. use B-tree Ti, do point search and find ∆ whose x-span contains t, read its conflict list C(∆) 2. if there are at least k segments in C(∆) intersect with t, return the top-k segments, else if i<l, then i=i+1, repeat step 1 3. scan entire S to find top-k segments to find top-k - An improvement is that instead of letting i=0 at the first step, we can directly start at level i=log(k/B) (because 2iB need to larger than k).
Query cost • Query performance guarantee comes from B-tree • For any query, seb-tree index can find the top-k/t segments in expected O(logBN+k/B) I/Os • The probability that seb-tree needs to trigger a brute force scan is less than B/N, and scanning the whole data set needs O(N/B) I/Os, this adds only O(1) to the total query cost.
Updating the seb-tree • Recall that to build a B-tree at level i, we need to - take a 1/2iB sampling on S to get Si - build a trapezoidal decomp. D(Si) - store the conflict list in the level i B-tree • Given a new segment s - if it changes none of the D(L0), …, D(Lλ), then simply follow the HDT to check where s belongs to. - if it does change one of the D(L0), …, D(Lλ), then we need to rebuild a larger potion of the seb-tree. • Deletion can be handled similarly.
Space-query tradeoff • Based on lemma 1: One will expect to see O(1/p) conflicting segments for any trapezoid on level Si, where p is sampling rate = 1/2iB • To avoid expensive I/O, we define threshold λ, when |C(∆)| > λ O(1/p), simply don’t store it (for query part, skip it) • In practice, λ=3 or 4 Sa Sd |C(∆)|=O(1/p) Sb Sc ∆ Se
Experiment • How seb-tree will behave when … 1) the number of time series changes 2) the deviation of time series changes 3) the threshold λ changes 4) Kmax in changes • Compare to R-tree
Experiment (1) • Index size & construction time
Experiment (2) • Query cost
Experiment (3) • Effect of Kmax
Conclusion • Study ranking queries on temporal data • Propose seb-tree • Take near-linear time to construction • Occupy near-linear space • Support dynamic update efficiently • Employ B-tree as its only building block.