390 likes | 508 Views
Partition Based Spatial – Merge Join. Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790). 1. Agenda. Problem Definition of Spatial Join PBSM Algorithm 3.1 Filter Step
E N D
Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790) 1
Agenda • Problem • Definition of Spatial Join • PBSM Algorithm 3.1 Filter Step 3.2 Refinement Step 3.3 Number of Partition 3.4 Spatial Partitioning Function • Performance 4.1 Indexed Nested Loops Join 4.2 R-tree Based Join Algorithm 4.3 Methodology 4.4 None of Indices Pre-exist 4.5 In the Presence of Pre-existing Index 4.6 CPU Costs • Conclusion 2
1. Problem In a spatial database system, like GIS, join queries objects involve large amount of memory Have no pre computing data for datasets Usually no index for intermediate result Solve this join problem efficiently 3
2. Definition of Spatial Join Q: “Find all pairs of rivers and cities that intersect” c2 c3 c5 Join Result Pairs: <r1,c1> <r1,c2> <r2,c5> c1 r1 c4 r2 An operation of combining 2 or more datasets based on their spatial relationship 4
3. PBSM Algorithm Filter Step Refinement Step Input R Unique Identifer (OID) Input C Partition Based Spatial-Merge Join (PBSM) PBSM operates in 2 steps 5
3.1 Filter Step c2 c3 c5 c1 r1 c4 r2 Key-Pointer Element Input Rkp <r1, mr1> <r2,mr2> Input Ckp <c1, mc1> <c2, mc2> ... • Purpose: • To find all objects whose MBR intersects the query rectangle • For each input (R and C), • Creation of Minimum Bounding Rectangle (MBR) • Rough Estimation for Search Region • Key-Pointer Element in New Input (Rkp and Ckp) (OID + MBR) 6
3.1 Filter Step c2 MBR.xu MBR.xl c3 c5 c1 r1 c4 r2 Start with the first entry r1, sweep a vertical line Check if MBRr1∩ MBRc2, add (OIDr1, OIDc2) to result set Check if MBRr1∩ MBRc1, add (OIDr1, OIDc1) to result set Scan until MBR.xu,start the next entry • Spatial Join (1st Scenario) • Rkp and Ckp fit into main memory • Plane-Sweeping Technique • Sort by MBR.xl for Each Input (Rkp and Ckp) • Select the MBR in Either Input (e.g. Rkp) with Smallest MBR.xl • Scan along the x-axis from MBR.xl to MBR.xu to check if MBRr∩ MBRc 7
3.1 Filter Step Partition 1 Partition 0 c2 c3 c5 Partition 3 Partition 2 c1 r1 c4 r2 • Spatial Join (2nd Scenario) • Rkp and Ckp do not fit into main memory • Spatial Partitioning Technique • Size of Each Partition for both Input (Rkp and Ckp) can fit into memory simultaneously • Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition • Result Pair <OIDR, OIDC> 8
3.2 Refinement Step Partition 1 Partition 0 Partition 3 Partition 2 c5 c4 • #2: To examine the actual R and S tuples & see if the attributes satisfy join condition r2 Result Pairs: Partition 1: <r2,c5> Partition 3: <r2,c5>, <r2,c4> • Purpose • #1: To eliminate duplicates induced by Partitioning 9
3.2 Refinement Step • Procedure • #1: Sort OID pairs: • Primary Sort Key: OIDR • Secondary Sort Key: OIDC • #2: Read R tuples first, then C
3.3 Number of Partitions • Number of Partition P is computed as:- where P: Number of partition R: Cardinality of R C : Cardinalityof C Sizekey-ptr: Size of a key-pointer element (in bytes) M: Size of main memory (in bytes) 11
3.4 Spatial Partitioning Function Partition 0 Partition 1 Universe Partition 2 Partition 3 Non-Uniform Distributed & Clustered Spatial Features By “Regular” Partitioning Method Large differences in size of partitions 12
3.4 Spatial Partitioning Function “Tile-based” Partitioning Method + Round Robin Mapping Scheme “Regular” Partitioning Method Step 1: Regular decomposition of universe into NT tiles, where NT> =P Step 2: Apply Tile-to-Partition Mapping Scheme Round Robin OR Hashing 13
3.4 Spatial Partitioning Function It assigns equal number of tuples to each partition • What is the PERFECTSpatial Partitioning Function ? • Considerations: • Number of Tiles • Tile-to-Partition Mapping Scheme(Round Robin OR Hashing) • Data set used for investigation: • Tiger Road Data (62.4MB, 456,613 tuples) • Sequoia Polygon Data (21.9MB, 58,115 tuples) 14
3.4 Spatial Partitioning Function The PERFECT Partitioning Function has a coefficient of variation = 0 Spatial Partitioning Function Alternatives: Tiger Road Data • Observation: Partitioning Function improves as No. of Tiles increases More uniform distribution 15
3.4 Spatial Partitioning Function Number of Tiles = An integral multiple of Number of Partitions Replication Overhead: Tiger Road Data (16 Partitions) Replication Overhead: SequoiaPolygon Data(16 Partitions) Observation: ↑ No. of Tiles, ↑ Replication Overhead 16
Scenario: • No. of Tiles = 9 • P = 3 • Tile-to-Partition Mapping Scheme = Round Robin c2 c1 r1 The entire column is being mapped to a single partition ∴ ↓ Replications by partitioning, ↓ Replication overheads
3.4 Spatial Partitioning Function Number of Tiles = An integral multiple of Number of Partitions Replication Overhead: Tiger Road Data (16 Partitions) Replication Overhead: SequoiaPolygon Data(16 Partitions) Observation: ↑ No. of Tiles, ↑ Replication Overhead 18
4. Performance Indexed Nested Loops Join PBSM Join(1024 tiles) R-tree Based Join V.S V.S 19
4.3 Methodology • Database System: Paradise • Machine: Sun SPARC-10/51 • 64 MB of memory • SunOS Release 4.1.3 • One Seagate 2GB disk 22
TIGER file • Road, Hydrography and Rail data of the United States etc… • 2 join queries • Road with Hydrography • Between the Road and the Rail data 23
Sequoia 2000 Storage Benchmark • Polygon • Regions of homogeneous landuse characteristics in California • Islands • Holes in the polygon data 24
TIGER: Join Road with Hydrograhy PBSM is 48-98% faster than the R-Tree Based; 93-300% faster than the Idx. Nested Loops. 26
TIGER: Join Road with Rail Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested Loops performs better than R-Tree Based. 27
Cluster Data continuously , i.e. not randomly distributed Data are mostly in sequential order in real life Less computationally expensive 28
Clustered TIGER: Join Road with Hydrography PBSM is 40% faster than the R-Tree Based; and 60-80% faster than the Idx. Nested Loops. 29
Costs • Index Building Cost • Cost of extracting the key-pointers from the input • Sorting the key-pointers • Building the index using the sorted key pointers • If Input is clustered No sorting key-pointers Cost of building index • Tree Joining Cost • Refinement Step Cost 30
Sequoia Data PBSM is 13-27% faster than the R-Tree Based; and 17-114% faster than the Idx. Nested Loops. 31
Summary PBSM is better than R-tree and the Indexed Nested Loops based algorithm When sizes of 2 inputs differ significantly,Indexed Nested Loops is better than the R-tree based algorithm All algorithms improve if join inputs are clustered 32
TIGER: Join Road with Hydrograhy When indices pre-exist on both the inputs, the R-tree based algorithm has the best performance 34
TIGER: Join Road with Rail When an index exists only on the smaller input, PBSM performs best. 35
4.6 CPU Cost • Insert a table • CPU cost > I/O cost • System • CPU intensive • Much less I/O is needed 36
5. Principal Behind • Divide and Conquer • Optimization on memory size
6. Playback of this presentation Efficient PBSM algorithm Comparison among different algorithm Performance Analysis Clustered Data Indexed Data 38
Question ? 39