350 likes | 470 Views
Efficient Temporal Join Processing using Indices. Donghui Zhang University of California, Riverside Vassilis J. Tsotras University of California, Riverside Bernhard Seeger University of Marburg, Germany. Contents. Problem definition: GTE-Join Straightforward approaches
E N D
Efficient Temporal Join Processing using Indices Donghui Zhang University of California, Riverside Vassilis J. Tsotras University of California, Riverside Bernhard Seeger University of Marburg, Germany ICDE 2002, San Jose, CA
Contents • Problem definition: GTE-Join • Straightforward approaches • Temporal indexing • Proposed join algorithms • Performance study • Conclusions ICDE 2002, San Jose, CA
Problem Definition • Temporal record: (key, start, end, attributes) • TE-Join: two records qualify for join if • their time intervals intersect; and • their keys are equal. ICDE 2002, San Jose, CA
TE-Join: “find the locations and Managers of all departments over time”. ICDE 2002, San Jose, CA
Interesting because: • temporal relations are large; • TE-Join is a special case, when r and i are (-, +). Problem Definition • GTE-Join: general TE-Join – record keys should be in a certain range r and time intervals should intersect a given interval i. ICDE 2002, San Jose, CA
GTE-Join: “find the locations and managers of departments in range [D1, D2] during time [5, 10]”. ICDE 2002, San Jose, CA
Straightforward Solutions • Non-indexed join; • Unsynchronized join; • Synchronized join using B+-trees; • Synchronized join using R-trees. ICDE 2002, San Jose, CA
Straightforward Solutions • Non-indexed join: existing TE-Join research [Zur97] focuses on non-indexed join; not efficient for GTE-Join due to full scan. • Unsynchronized join: separate the selection and join phases; not efficient for: • storage of intermediate result; • selection in one relation ignores data distribution of the other relation. ICDE 2002, San Jose, CA
Not efficient: Straightforward Solutions • Synchronized using B+-trees; • If cluster on start: • Cluster on end is similar. ICDE 2002, San Jose, CA
Straightforward Solutions • Synchronized using B+-trees; • If cluster on key: • records with keys in r are stored together and are sorted; • focus on these records in each relation and sort-merge join, while skipping those whose intervals not in i. • However, not efficient since records in the query rectangleare scattered. ICDE 2002, San Jose, CA
Straightforward Solutions • Synchronized using R-trees; • Store each record as a two-dimensional interval in the R-tree; • Use existing R-tree join algorithms [BKS93, HJR97]; • Modification: integrate the selection regarding query rectangle. • However, not efficient since R-trees do not handle long intervals well. ICDE 2002, San Jose, CA
Our Solutions • Synchronized join using temporal indices. • Multi-version B+-tree (MVBT) [BGO+96]: asymptotically optimal space, update, query. • We propose: two categories of synchronized, MVBT-based join algorithms. (apply to other temporal indices as well) ICDE 2002, San Jose, CA
Review of MVBT • Suppose a page holds up to 3 records. ICDE 2002, San Jose, CA
Review of MVBT • Suppose a page holds up to 3 records. ICDE 2002, San Jose, CA
Review of MVBT • Suppose a page holds up to 3 records. ICDE 2002, San Jose, CA
Review of MVBT • Suppose a page holds up to 3 records. ICDE 2002, San Jose, CA
Review of MVBT • Suppose a page holds up to 3 records. ICDE 2002, San Jose, CA
Review of MVBT • A “forest”: different trees may overlap; • Root nodes correspond to contiguous, non-intersecting time intervals; • A record may be stored in multiple pages; end time of all but the last copy is +. • Range-Interval selection algorithms [BS96]: avoid duplicate by reporting the first copy. ICDE 2002, San Jose, CA
The Incorrect End Time Problem [BS96] reports first copy of x (whose end is +); would lead GTE-Join algorithms to join x with y. Solution: report the rightmost copy! ICDE 2002, San Jose, CA
Top-down Approaches • Idea: for each pair of trees, one from each MVBT forest, synchronized tree traversal (STT). • STT for two trees: • initially, join root nodes; • to join two nodes, join their children; • eventually, join elements in leaf pages. ? join condition? ICDE 2002, San Jose, CA
Balancing Condition Optimization (BCO) • To find <x, y>, Page 3 and page 0 has to join; • In general, join two pages even though they do not intersect. Inefficient! • BCO: balancing two conditions. (1) only intersecting pages join; (2) examine records even if not last copy. E.g. join <x, y> when joining page 2 with page 0. ICDE 2002, San Jose, CA
A1’ • With VHO: <A1, B2>, <A1, B3> Virtual Height Optimization (VHO) • At the middle level, STT joins: <A2, B2>, <A3, B2>, <A4, B2>,<A2, B3>, <A3, B3> ,<A4, B3> ICDE 2002, San Jose, CA
Sideways Approach 1: Link-based • find pairs of data pages that intersect with the right border of the query rectangle and with each other; • keep such pairs in priority queue; • sweep left synchronously. • In each leaf page, store a pointer to its predecessor; • For GTE-Join: D ICDE 2002, San Jose, CA
Sideways Approach 1: Link-based • find pairs of data pages that intersect with the right border of the query rectangle and with each other; • keep such pairs in priority queue; • sweep left synchronously. • In each leaf page, store a pointer to its predecessor; D • For GTE-Join: • special techniques to avoid duplicates. ICDE 2002, San Jose, CA
Sideways Approach 2: Plane Sweep • Similar to link-based; • Maintain two priority queues, one for each MVBT; • At each step, access the leaf page with the largest end time and add records to buffer; • To add records to buffer, join with existing records from the other MVBT; • Throw away useless records. ICDE 2002, San Jose, CA
Performance Study ICDE 2002, San Jose, CA
Experimental Setup • Implemented in GNU C++; • Sun Enterprise 250 Server machine with two UltraSPARC-II processors using Solaris 2.8; • Page size = 8KB; • Buffer size = 10MB; LRU buffer; • Each data set: 10 million records; • QRS: size ratio between the query rectangle and the whole space. • Long intervals: 1/100 of time space; • Short intervals: 1/10,000 of time space. ICDE 2002, San Jose, CA
GTE-Join Performance Joining mainly long intervals. ICDE 2002, San Jose, CA
GTE-Join Performance Joining mainly short intervals. ICDE 2002, San Jose, CA
GTE-Join Performance (Log Scale) Varying QRS. ICDE 2002, San Jose, CA
Conclusions • We addressed the GTE-Join; • Unsynchronized approach not efficient; • Synchronized approaches based on traditional indices (B+-tree, R-tree) also not efficient; • We proposed synchronized approaches based on temporal indices (MVBT); • We also proposed BCO and VHO optimizations; • Experiments: link-based is the best. ICDE 2002, San Jose, CA
Thank you! ICDE 2002, San Jose, CA