Should SDBMS support the Join Index?: A Case study from CrimeStat

Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹ ¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu ²Ned Levine and Assoicates, Houston, TX, Ned@nedlevine.com ³National Institute of Justice, Washington D.C, Ronald.Wilson@usdoj.gov

Outline • Introduction • Motivation • Problem Statement • Related Work • Contributions • Self-Join Index • Experimental Evaluation • Conclusion and Future Work

Motivation Application Domains Crime Analysis:Where are the burglary hotspots ? Epidemiology:Is Cancer Spatially Clustered ? Transportation:Which major highways require traffic calming measures ? An Example Query:Where are the Burglary hotspots ?

W-Matrix and W-Queries W-Matrix WN : Row Normalized W-Matrix Neighborhood Graph Queries that perform a repeated computation of the W-Matrix : W-Queries. W-Queries Hotspots K-Function Moran’s I Geary’s C G Statistic

W-Operations Notion of neighbors, successors and predecessors. Operations Neighbors Successor(s) Predecessor(s) Composite Others get-all-successors() get-all-predecessors() get-all-predecessors-of-a-successor() Delete() get-all-neighbors() get-a-successor() get-a-predecessor() get-a-predecessor-of-a-successor() N3 N3 N7 N2 N7 N6 get-all-predecessors-of-a-successor(N2, Node-id) get-all-predecessors(N2) get-a-successor(N2,Node-id) get-a-predecessor(N2,Node-id) Delete(N2,N1,N3) get-all-neighbors(N2) get-all-successors(N2) N2 get-a-predecessor-of-a-successor(N2,Node-id,Node-id) N1 N6 N1 N5 N5 N4 N4 Input Operation Output

3000 – 2500 – 2000 – 1500 – 1000 – 500 – 0 – Complete Spatial Randomness K 5 – 10 – 15 – 20 – 30 – 40 – Distance (Miles) W-Query Processing Algorithms Algorithm CalcRipleyK Algorithm Hotspots_JI • get-all-neighbors(N) • Frequency ← Size(get-all-successors(N)) Input • Stage 1: Hotspot Identification • Identify a Seed. • get-all-neighbors(Seed) • get-all-successors(Seed) • Stage 2: Hotspot Refinement • P ←get-a-predecessor-of-a-successor(Seed,succ-id) • If P is Correlates better with the Successor than with the Seed. • Remove the Successor from successor list. • Stage 3: Update Remaining Nodes • For each, S in Hotspot • Delete(S) Output

Courtsey: Ned Levine and Associates Problem Statement • Given: • A spatial (crime) data warehouse. • A set of W- Operations. • Find: • A suitable spatial index type representation. • Objective: • User response time is minimized. • Constraints: • Dataset is updated infrequently. • Concurrency control and recovery considerations are addressed separately. Input Data & W Operations Output Courtsey: Ned Levine and Associates

Challenges • Scalability to Large Datasets Query:Where are the Burglary hotspots ? Dataset Size =14852 Crime Reports CrimeStat Libraries’ Response Time =2Hrs 30 Minutes

Related Work: Classification SDBMS Tools Current R Tree family index structures performRepeated on-the-fly Wcomputation. Computationally Expensive!! Our Approach: Pre-computed W ! (Self-join)

Contributions • Modeled W-Queries • Proposed a set of W-Operations • W-Query Processing Algorithms • Self-join Index • Representation • Algebraic Cost model: Operations • Experimental Evaluation • Experimental Setup • User Response time analysis

Self-Join Index: Representation Key Observations • W-Matrix : Neighborhood Graphor Self-join • Classical Join Index :Edge List • Which representation can localize neighbor, successor and predecessor information ? Edge List W-Matrix ↔ Self-join Neighborhood Graph Self-Join Adjacency List Index Adjacency list LOCALIZES successor, predecessor and row normalized Information Edge List SCATTERS these.

Experimental Evaluation: Experiment Setup Dataset Size Size of the Police Precincts Self-Join Index Generator SJALI W Query Processing Algorithms Response time Analysis Candidate Algorithms (CalcRipleyK, Hotspots_JI) Candidates • CrimeStat Libraries • R-Tree: Tree Matching • Self-Join Index • Experiment Goals: Compare candidates on response times. Metric of Comparison: Response time Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date) Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM

Baltimore Auto-theft Dataset Crime Report Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports Courtsey: Ned Levine and Associates(www.nedlevine.com )

Response Time Analysis: Comparison with R-Tree Questions: How does the response time of the Hotspot Identification Query vary with dataset size ? How does the response time of the Ripley’s K function Query vary with dataset size ? Response time comparison for K-Function computation. Response time comparison for hotspot identification. Fixed Parameters Hotspots K Function # of max-significance levels = 100 Hotspot min-Size Threshold = 10 Crime Reports Overall Trend: Self-join IndexVsR-Tree:Response time Reduced by a factor of 2.

Response Time Analysis: Comparison with CrimeStat Questions: How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ? Response time comparison for hotspot identification. Response time comparison for K-Function computation. Fixed Parameters K Function Hotspots # of max-significance levels = 100 Hotspot min-Size Threshold = 10 Crime Reports Overall Trend: Self-join Index Vs CrimeStat:Response time Reduced by a factor of 40.

Future work • Experimental Quantification • I/O costs of W-Query Processing Algorithms. • Other W-Queries • Local Moran’s I, Local Getis Ord. • Larger datasets of >=100000, will R-Tree be comparable ? • I/O Cost Models for W-Query Processing Algorithms. • Further I/O Optimization • Extracting optimal page access sequences for processing W-Queries. • Optimizing the number of W-Query operations. Conclusions • W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation. • W-Operations of W-Queries. • Self-join adjacency list index more scalable than R-Tree and CrimeStat.

Acknowledgment • Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities. • This Work was supported by Grants from NSF, USDOD and NIJ. Thank You for your Questions, Comments and Patience! 20

Should SDBMS support the Join Index?: A Case study from CrimeStat