1 / 17

Should SDBMS support the Join Index?: A Case study from CrimeStat

Should SDBMS support the Join Index?: A Case study from CrimeStat. Pradeep Mohan ¹ , Shashi Shekhar ¹ , Ned Levine ² , Ronald E. Wilson ³ , Betsy George ¹ , Mete Celik ¹ ¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu

sveta
Download Presentation

Should SDBMS support the Join Index?: A Case study from CrimeStat

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹ ¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu ²Ned Levine and Assoicates, Houston, TX, Ned@nedlevine.com ³National Institute of Justice, Washington D.C, Ronald.Wilson@usdoj.gov

  2. Outline • Introduction • Motivation • Problem Statement • Related Work • Contributions • Self-Join Index • Experimental Evaluation • Conclusion and Future Work

  3. Motivation Application Domains Crime Analysis:Where are the burglary hotspots ? Epidemiology:Is Cancer Spatially Clustered ? Transportation:Which major highways require traffic calming measures ? An Example Query:Where are the Burglary hotspots ?

  4. W-Matrix and W-Queries W-Matrix WN : Row Normalized W-Matrix Neighborhood Graph Queries that perform a repeated computation of the W-Matrix : W-Queries. W-Queries Hotspots K-Function Moran’s I Geary’s C G Statistic

  5. W-Operations Notion of neighbors, successors and predecessors. Operations Neighbors Successor(s) Predecessor(s) Composite Others get-all-successors() get-all-predecessors() get-all-predecessors-of-a-successor() Delete() get-all-neighbors() get-a-successor() get-a-predecessor() get-a-predecessor-of-a-successor() N3 N3 N7 N2 N7 N6 get-all-predecessors-of-a-successor(N2, Node-id) get-all-predecessors(N2) get-a-successor(N2,Node-id) get-a-predecessor(N2,Node-id) Delete(N2,N1,N3) get-all-neighbors(N2) get-all-successors(N2) N2 get-a-predecessor-of-a-successor(N2,Node-id,Node-id) N1 N6 N1 N5 N5 N4 N4 Input Operation Output

  6. 3000 – 2500 – 2000 – 1500 – 1000 – 500 – 0 – Complete Spatial Randomness K 5 – 10 – 15 – 20 – 30 – 40 – Distance (Miles) W-Query Processing Algorithms Algorithm CalcRipleyK Algorithm Hotspots_JI • get-all-neighbors(N) • Frequency ← Size(get-all-successors(N)) Input • Stage 1: Hotspot Identification • Identify a Seed. • get-all-neighbors(Seed) • get-all-successors(Seed) • Stage 2: Hotspot Refinement • P ←get-a-predecessor-of-a-successor(Seed,succ-id) • If P is Correlates better with the Successor than with the Seed. • Remove the Successor from successor list. • Stage 3: Update Remaining Nodes • For each, S in Hotspot • Delete(S) Output

  7. Courtsey: Ned Levine and Associates Problem Statement • Given: • A spatial (crime) data warehouse. • A set of W- Operations. • Find: • A suitable spatial index type representation. • Objective: • User response time is minimized. • Constraints: • Dataset is updated infrequently. • Concurrency control and recovery considerations are addressed separately. Input Data & W Operations Output Courtsey: Ned Levine and Associates

  8. Challenges • Scalability to Large Datasets Query:Where are the Burglary hotspots ? Dataset Size =14852 Crime Reports CrimeStat Libraries’ Response Time =2Hrs 30 Minutes

  9. Related Work: Classification SDBMS Tools Current R Tree family index structures performRepeated on-the-fly Wcomputation. Computationally Expensive!! Our Approach: Pre-computed W ! (Self-join)

  10. Contributions • Modeled W-Queries • Proposed a set of W-Operations • W-Query Processing Algorithms • Self-join Index • Representation • Algebraic Cost model: Operations • Experimental Evaluation • Experimental Setup • User Response time analysis

  11. Self-Join Index: Representation Key Observations • W-Matrix : Neighborhood Graphor Self-join • Classical Join Index :Edge List • Which representation can localize neighbor, successor and predecessor information ? Edge List W-Matrix ↔ Self-join Neighborhood Graph Self-Join Adjacency List Index Adjacency list LOCALIZES successor, predecessor and row normalized Information Edge List SCATTERS these.

  12. Experimental Evaluation: Experiment Setup Dataset Size Size of the Police Precincts Self-Join Index Generator SJALI W Query Processing Algorithms Response time Analysis Candidate Algorithms (CalcRipleyK, Hotspots_JI) Candidates • CrimeStat Libraries • R-Tree: Tree Matching • Self-Join Index • Experiment Goals: Compare candidates on response times. Metric of Comparison: Response time Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date) Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM

  13. Baltimore Auto-theft Dataset Crime Report Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports Courtsey: Ned Levine and Associates(www.nedlevine.com )

  14. Response Time Analysis: Comparison with R-Tree Questions: How does the response time of the Hotspot Identification Query vary with dataset size ? How does the response time of the Ripley’s K function Query vary with dataset size ? Response time comparison for K-Function computation. Response time comparison for hotspot identification. Fixed Parameters Hotspots K Function # of max-significance levels = 100 Hotspot min-Size Threshold = 10 Crime Reports Overall Trend: Self-join IndexVsR-Tree:Response time Reduced by a factor of 2.

  15. Response Time Analysis: Comparison with CrimeStat Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Response time comparison for hotspot identification. Response time comparison for K-Function computation. Fixed Parameters K Function Hotspots # of max-significance levels = 100 Hotspot min-Size Threshold = 10 Crime Reports Overall Trend: Self-join Index Vs CrimeStat:Response time Reduced by a factor of 40.

  16. Future work • Experimental Quantification • I/O costs of W-Query Processing Algorithms. • Other W-Queries • Local Moran’s I, Local Getis Ord. • Larger datasets of >=100000, will R-Tree be comparable ? • I/O Cost Models for W-Query Processing Algorithms. • Further I/O Optimization • Extracting optimal page access sequences for processing W-Queries. • Optimizing the number of W-Query operations. Conclusions • W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation. • W-Operations of W-Queries. • Self-join adjacency list index more scalable than R-Tree and CrimeStat.

  17. Acknowledgment • Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities. • This Work was supported by Grants from NSF, USDOD and NIJ. Thank You for your Questions, Comments and Patience! 20

More Related