1 / 39

Partition Based Spatial – Merge Join

Partition Based Spatial – Merge Join. Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790). 1. Agenda. Problem Definition of Spatial Join PBSM Algorithm 3.1 Filter Step

mairi
Download Presentation

Partition Based Spatial – Merge Join

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790) 1

  2. Agenda • Problem • Definition of Spatial Join • PBSM Algorithm 3.1 Filter Step 3.2 Refinement Step 3.3 Number of Partition 3.4 Spatial Partitioning Function • Performance 4.1 Indexed Nested Loops Join 4.2 R-tree Based Join Algorithm 4.3 Methodology 4.4 None of Indices Pre-exist 4.5 In the Presence of Pre-existing Index 4.6 CPU Costs • Conclusion 2

  3. 1. Problem In a spatial database system, like GIS, join queries objects involve large amount of memory Have no pre computing data for datasets Usually no index for intermediate result Solve this join problem efficiently 3

  4. 2. Definition of Spatial Join Q: “Find all pairs of rivers and cities that intersect” c2 c3 c5 Join Result Pairs: <r1,c1> <r1,c2> <r2,c5> c1 r1 c4 r2 An operation of combining 2 or more datasets based on their spatial relationship 4

  5. 3. PBSM Algorithm Filter Step Refinement Step Input R Unique Identifer (OID) Input C Partition Based Spatial-Merge Join (PBSM) PBSM operates in 2 steps 5

  6. 3.1 Filter Step c2 c3 c5 c1 r1 c4 r2 Key-Pointer Element Input Rkp <r1, mr1> <r2,mr2> Input Ckp <c1, mc1> <c2, mc2> ... • Purpose: • To find all objects whose MBR intersects the query rectangle • For each input (R and C), • Creation of Minimum Bounding Rectangle (MBR) • Rough Estimation for Search Region • Key-Pointer Element in New Input (Rkp and Ckp) (OID + MBR) 6

  7. 3.1 Filter Step c2 MBR.xu MBR.xl c3 c5 c1 r1 c4 r2 Start with the first entry r1, sweep a vertical line Check if MBRr1∩ MBRc2, add (OIDr1, OIDc2) to result set Check if MBRr1∩ MBRc1, add (OIDr1, OIDc1) to result set Scan until MBR.xu,start the next entry • Spatial Join (1st Scenario) • Rkp and Ckp fit into main memory • Plane-Sweeping Technique • Sort by MBR.xl for Each Input (Rkp and Ckp) • Select the MBR in Either Input (e.g. Rkp) with Smallest MBR.xl • Scan along the x-axis from MBR.xl to MBR.xu to check if MBRr∩ MBRc 7

  8. 3.1 Filter Step Partition 1 Partition 0 c2 c3 c5 Partition 3 Partition 2 c1 r1 c4 r2 • Spatial Join (2nd Scenario) • Rkp and Ckp do not fit into main memory • Spatial Partitioning Technique • Size of Each Partition for both Input (Rkp and Ckp) can fit into memory simultaneously • Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition • Result Pair <OIDR, OIDC> 8

  9. 3.2 Refinement Step Partition 1 Partition 0 Partition 3 Partition 2 c5 c4 • #2: To examine the actual R and S tuples & see if the attributes satisfy join condition r2 Result Pairs: Partition 1: <r2,c5> Partition 3: <r2,c5>, <r2,c4> • Purpose • #1: To eliminate duplicates induced by Partitioning 9

  10. 3.2 Refinement Step • Procedure • #1: Sort OID pairs: • Primary Sort Key: OIDR • Secondary Sort Key: OIDC • #2: Read R tuples first, then C

  11. 3.3 Number of Partitions • Number of Partition P is computed as:- where P: Number of partition R: Cardinality of R C : Cardinalityof C Sizekey-ptr: Size of a key-pointer element (in bytes) M: Size of main memory (in bytes) 11

  12. 3.4 Spatial Partitioning Function Partition 0 Partition 1 Universe Partition 2 Partition 3 Non-Uniform Distributed & Clustered Spatial Features By “Regular” Partitioning Method  Large differences in size of partitions 12

  13. 3.4 Spatial Partitioning Function “Tile-based” Partitioning Method + Round Robin Mapping Scheme “Regular” Partitioning Method Step 1: Regular decomposition of universe into NT tiles, where NT> =P Step 2: Apply Tile-to-Partition Mapping Scheme Round Robin OR Hashing 13

  14. 3.4 Spatial Partitioning Function It assigns equal number of tuples to each partition • What is the PERFECTSpatial Partitioning Function ? • Considerations: • Number of Tiles • Tile-to-Partition Mapping Scheme(Round Robin OR Hashing) • Data set used for investigation: • Tiger Road Data (62.4MB, 456,613 tuples) • Sequoia Polygon Data (21.9MB, 58,115 tuples) 14

  15. 3.4 Spatial Partitioning Function The PERFECT Partitioning Function has a coefficient of variation = 0 Spatial Partitioning Function Alternatives: Tiger Road Data • Observation: Partitioning Function improves as No. of Tiles increases  More uniform distribution 15

  16. 3.4 Spatial Partitioning Function Number of Tiles = An integral multiple of Number of Partitions Replication Overhead: Tiger Road Data (16 Partitions) Replication Overhead: SequoiaPolygon Data(16 Partitions) Observation: ↑ No. of Tiles, ↑ Replication Overhead 16

  17. Scenario: • No. of Tiles = 9 • P = 3 • Tile-to-Partition Mapping Scheme = Round Robin c2 c1 r1 The entire column is being mapped to a single partition ∴ ↓ Replications by partitioning, ↓ Replication overheads

  18. 3.4 Spatial Partitioning Function Number of Tiles = An integral multiple of Number of Partitions Replication Overhead: Tiger Road Data (16 Partitions) Replication Overhead: SequoiaPolygon Data(16 Partitions) Observation: ↑ No. of Tiles, ↑ Replication Overhead 18

  19. 4. Performance Indexed Nested Loops Join PBSM Join(1024 tiles) R-tree Based Join V.S V.S 19

  20. 4.1 Index Nested Loops Join 20

  21. 4.2 R-tree Based Join Algorithm 21

  22. 4.3 Methodology • Database System: Paradise • Machine: Sun SPARC-10/51 • 64 MB of memory • SunOS Release 4.1.3 • One Seagate 2GB disk 22

  23. TIGER file • Road, Hydrography and Rail data of the United States etc… • 2 join queries • Road with Hydrography • Between the Road and the Rail data 23

  24. Sequoia 2000 Storage Benchmark • Polygon • Regions of homogeneous landuse characteristics in California • Islands • Holes in the polygon data 24

  25. 4.4 NONE OF INDICES PRE-EXIST 25

  26. TIGER: Join Road with Hydrograhy PBSM is 48-98% faster than the R-Tree Based; 93-300% faster than the Idx. Nested Loops. 26

  27. TIGER: Join Road with Rail Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested Loops performs better than R-Tree Based. 27

  28. Cluster Data continuously , i.e. not randomly distributed Data are mostly in sequential order in real life Less computationally expensive 28

  29. Clustered TIGER: Join Road with Hydrography PBSM is 40% faster than the R-Tree Based; and 60-80% faster than the Idx. Nested Loops. 29

  30. Costs • Index Building Cost • Cost of extracting the key-pointers from the input • Sorting the key-pointers • Building the index using the sorted key pointers • If Input is clustered No sorting key-pointers Cost of building index  • Tree Joining Cost • Refinement Step Cost 30

  31. Sequoia Data PBSM is 13-27% faster than the R-Tree Based; and 17-114% faster than the Idx. Nested Loops. 31

  32. Summary PBSM is better than R-tree and the Indexed Nested Loops based algorithm When sizes of 2 inputs differ significantly,Indexed Nested Loops is better than the R-tree based algorithm All algorithms improve if join inputs are clustered 32

  33. 4.5 IN THE PRESENCE OF PRE-EXISTING INDEX 33

  34. TIGER: Join Road with Hydrograhy When indices pre-exist on both the inputs, the R-tree based algorithm has the best performance 34

  35. TIGER: Join Road with Rail When an index exists only on the smaller input, PBSM performs best. 35

  36. 4.6 CPU Cost • Insert a table • CPU cost > I/O cost • System • CPU intensive • Much less I/O is needed 36

  37. 5. Principal Behind • Divide and Conquer • Optimization on memory size

  38. 6. Playback of this presentation Efficient PBSM algorithm Comparison among different algorithm Performance Analysis Clustered Data Indexed Data 38

  39. Question ? 39

More Related