1 / 37

Joining Massive High-Dimensional Datasets

Joining Massive High-Dimensional Datasets. Tamer Kahveci Christian A. Lang Ambuj K. Singh Department of Computer Science University of California at Santa Barbara http://www.cs.ucsb.edu/~tamer. Motivation: Sample Queries. Join is fundamental database primitive

dholzer
Download Presentation

Joining Massive High-Dimensional Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joining Massive High-Dimensional Datasets Tamer Kahveci Christian A. Lang Ambuj K. Singh Department of Computer Science University of California at Santa Barbara http://www.cs.ucsb.edu/~tamer Kahveci, Lang, Singh

  2. Motivation: Sample Queries • Join is fundamental database primitive • Spatial Join: Find all hotels in California that are within three miles of a recreation area. • Sequence Join: Find all pairs of companies from New York Exchange and Tokyo Exchange that have similar closing prices for one month Kahveci, Lang, Singh

  3. Hotels (m) Recreation areas (n) Buffer Recreation areas Hotels Motivation • We assume limited buffer space. • Joining two datasets is expensive • I/O cost • CPU cost • O(mn) Kahveci, Lang, Singh

  4. The Naive Solution: NLJ Dataset 1 (m pages) Buffer = B = 4 min{m,n}+mn/(B-1) page reads mn page comparisons Dataset 2 (n pages) We do not need to compare all page pairs! Kahveci, Lang, Singh

  5. Outline • Reducing search space: Prediction Matrix • Minimizing I/O cost by clustering • Square Cluster • Cost Cluster • Maximizing buffer reuse • Experimental results Kahveci, Lang, Singh

  6. PM-NLJ • Predict the candidate page pairs using plane sweep method on an index structure. Dataset 1 Dataset 2 Kahveci, Lang, Singh

  7. Prediction of Join Kahveci, Lang, Singh

  8. Prediction of Join Kahveci, Lang, Singh

  9. PM-NLJ • Predict the candidate page pairs using plane sweep method on an index structure. Dataset 1 • The final estimate is called Prediction Matrix (PM). • Restrict NLJ to marked entries of PM. • We call this method PM-NLJ. Dataset 2 Kahveci, Lang, Singh

  10. PM-NLJ • The number of marked entries = e. • Performance improvement rate = mn/e. Dataset 1 Dataset 2 Is there a better read schedule? Kahveci, Lang, Singh

  11. Outline • Reducing search space: Prediction Matrix • Minimizing I/O cost by clustering • Square Cluster • Cost Cluster • Maximizing buffer reuse • Experimental results Kahveci, Lang, Singh

  12. Minimizing Number of I/O:Square Clustering • PM-NLJ reads min{m’,n’}+e’ = 9 pages. • Let B=6. Dataset 1 • m’+n’ = 6 page reads suffices. • Savings = e’-max{m’,n’}. • Maximize e’ • Minimize max{m’,n’} • m’+n’ = B • m’=n’=B/2. Dataset 2 Kahveci, Lang, Singh

  13. Minimizing Number of I/O:Square Clustering Dataset 1 O(e) space & time complexity Can we reduce total I/O cost by reducing the amount of random seeks? Dataset 2 Kahveci, Lang, Singh

  14. Minimizing Random Seek Cost: Cost Clustering Dataset 1 • The location of • the pages is • important as • well as their • number! Dataset 2 Kahveci, Lang, Singh

  15. Minimizing Random Seek Cost: Cost Clustering Dataset 1 • O(e) space • complexity • O(e3/2) time • complexity Dataset 2 Kahveci, Lang, Singh

  16. Outline • Reducing search space: Prediction Matrix • Minimizing I/O cost by clustering • Square Cluster • Cost Cluster • Maximizing buffer reuse • Experimental results Kahveci, Lang, Singh

  17. C1 C2 C3 C4 C5 sum 5 5 4 5 2 21 Maximizing Cache Reuse Dataset 1 B = 5 pages C1 C3 Scenario 1 • Cluster order = (C4,C1,C3,C5,C2) • 5+4+3+2+5=19 page reads. C2 Dataset 2 C4 C5 Kahveci, Lang, Singh

  18. C1 C2 C3 C4 C5 sum 5 5 4 5 2 21 Maximizing Cache Reuse Dataset 1 Scenario 1 = 19 C1 Scenario 2 • Cluster order = (C4,C2,C1,C3,C5) • 5+2+3+3+2=15 page reads. C3 C2 Dataset 2 C4 What is the best schedule? C5 Kahveci, Lang, Singh

  19. Sharing Graph C2 2 C1 1 1 1 C3 3 C5 C4 Sharing Graph (SG) Dataset 1 C1 C3 C2 Dataset 2 C4 C5 Kahveci, Lang, Singh

  20. Finding Best Schedule • Each schedule is a path on SG. • Cache reuse = sum of weights of the edges of the corresponding path on SG. • Equivalent to TSP. • NP-Complete. • Use greedy heuristic to find optimal path. C2 2 C1 1 1 1 C3 3 C5 C4 Kahveci, Lang, Singh

  21. Outline • Reducing search space: Prediction Matrix • Minimizing I/O cost by clustering • Square Cluster • Cost Cluster • Maximizing buffer reuse • Experimental results Kahveci, Lang, Singh

  22. Experimental Setup: Datasets • Low dimensional data: • 2-D road intersections of Long Beach (LBeach) & Montgomery County (MGcounty). • 53K & 39K vectors • High dimensional data: • 60-D feature vectors for satellite image database (landsat). • 275K vectors • Sequence data: • Human chromosome 18 (HChr18) & mouse chromosome 18 (MChr18) • 4.2 M & 2.3 M nucleotides Kahveci, Lang, Singh

  23. Experimental Setup: Compared Techniques • NLJ • Epsilon Grid Order (EGO) [BBKK’01] • BFRJ [HJR’97] • PM-NLJ • Random-SC • SC • CC Kahveci, Lang, Singh

  24. Experimental Setup • Three optimizations tested: • OPT 1: reducing space by using the PM. • OPT 2: clustering. • OPT 3: cluster scheduling. Kahveci, Lang, Singh

  25. opt1 opt2 opt3 Itemized Cost Analysis Join on MGCounty & LBeach Kahveci, Lang, Singh

  26. opt1 opt2 opt3 Total Cost Analysis of Various Optimizations Self-join on HChr18 Kahveci, Lang, Singh Buffer Size (num pages)

  27. Comparison of SC & CC Kahveci, Lang, Singh

  28. Total Cost Analysis Join on landsat data Kahveci, Lang, Singh Buffer Size (num pages)

  29. Scalability Analysis Join on landsat data Kahveci, Lang, Singh Database Size (num vectors per database)

  30. Discussion • We proposed three optimizations for join operator. • Prediction matrix • Clustering • Buffer recycling • SC is 2 to 86 times faster than competing techniques for spatial databases, and 13 to 133 times faster than competing techniques for sequence databases • SC is very close to the optimal technique (CC). Kahveci, Lang, Singh

  31. Future Directions • The solution can be generalized to multi-way joins. • Similar optimizations can be applied to NN queries. • Can be applied to biological data. Kahveci, Lang, Singh

  32. Related Work • Join without index • Arge et al 1998 • Blasgen et al 1977 • Bohm et al 2001 • Chan et al 1997 • Graefe 1994 • Koudas et al 1997 • Koudas et al 2000 • Orenstein 1986 • Patel et al 1996 • Shim et al 2002 • Xiao et al 2001 • Join with index • Bercken et al 2000 • Bohm et al 2001 • Brinkhoff et al 1993 • Gurret et al 2000 • Hjaltson et al 1998 • Huang et al 1997 • Lo et al 1994 • Lo et al 1996 THANK YOU Kahveci, Lang, Singh

  33. Using Sharing Graph to Determine Cache Reuse Scenario 1 Scenario 2 C2 C2 2 2 C1 C1 1 1 1 1 1 1 0 C3 C3 0 0 3 3 C5 C5 C4 C4 Reuse = 1+1 = 2 Reuse = 3+2+1 = 6 Kahveci, Lang, Singh

  34. Spatial Join Example Recreation areas Hotels Kahveci, Lang, Singh

  35. Hotels Recreation areas Spatial Join Example Kahveci, Lang, Singh

  36. The Naive Solution: NLJ Dataset 1 (m pages) Buffer = B = 4 min{m,n}+mn/(B-1) page reads mn page comparisons Dataset 2 (n pages) We do not need to compare all page pairs! Kahveci, Lang, Singh

  37. Reading Pages in a Better Order 1 seek + 4 page transfers 3 seeks + 3 page transfers Kahveci, Lang, Singh

More Related