220 likes | 405 Views
Cascading spatio-temporal pattern discovery: A summary of results. Pradeep Mohan ¹ , Shashi Shekhar ¹ , James A.Shine ² , James P.Rogers 2 ¹University of Minnesota, Twin-Cities, {mohan,shekhar}@cs.umn.edu
E N D
Cascading spatio-temporal pattern discovery: A summary of results Pradeep Mohan¹, Shashi Shekhar¹, James A.Shine², James P.Rogers2 ¹University of Minnesota, Twin-Cities, {mohan,shekhar}@cs.umn.edu ²Engineering Research and Development Center, Alexandria, VA {James.A.Shine, James.P.Rogers.II}@usace.army.mil
Outline • Introduction • Motivation • Problem Statement • Related Work • Contributions • Interest Measure • CSTP Miner Algorithm • Evaluation and Case Study • Conclusion and Future Work
Motivation : Public Safety T1 T2 T3 Aggregate(T1,T2,T3) Assault(A) A.2 A.2 C.4 B.2 C.4 A.3 Bar Closing(B) A.3 C.1 B.2 B.1 Drunk Driving (C) C.1 C.2 B.1 C.3 A.1 C.3 C2 A.4 A.1 A.4 Cascading spatio-temporal pattern (CSTP) • Partially ordered subsets of ST event types. Bar Closing Drunk Driving • Located together in space. Assault • Occur in stages over time. Stages: Bar Closing, Assault , Drunk Driving, Hurricane, Climate change etc. Other Applications: Climate change, epidemiology, evacuation planning.
Problem Definition • Input: a) ST framework, b) directed ST neighbor relation R, c) Interest measure threshold Aggregate(T1,T2,T3) A.2 C.4 A.3 • Output: A set of CSTPs with interestingness >= threshold B.2 C.1 • Objective: a) Minimize computation costs while discovering statistically meaningful CSTPs. B.1 C.3 C2 A.1 A.4 • Constraints : a) Correctness and Completeness Example: C ST Join (R) R = {0.5 Miles, 2 min.} A Threshold = 0.5 B
Challenges and Contributions Challenges • Space and Time are continuous • Many overlapping ST neighborhoods • Neighborhood enumeration is computationally challenging • Conflicting Requirements • Ex., Statistical interpretation Vs. computational scalability • Exponential Candidate Space • Ex., Candidate CSTPs exponential in the number of event types Contributions • Interest Measures • Statistical Interpretation • Computational Structure • CSTP Miner Algorithm • Filtering Strategies • Evaluation • Experimental Evaluation • Case study
Limitations of Related Work: ST Data Mining • Limitations • [ST Co-occurrence] • Treating space and time independently. • Absence of partial order • [ST Sequence] • Does not account for multiply connected patterns(e.g. nonlinear) • Misses non-linear semantics. • No ST statistical interpretation. 6
Interest Measures Aggregate(T1,T2,T3) • Cascade Participation Ratio (CPR) : A.2 C.4 A.3 B.2 [Conditional Probability of observing an instance of CSTP having seen an Instance of A] C.1 B.1 C.3 C2 • Cascade Participation Index (CPI) : A.1 A.4 Lower bound on the Conditional Probability of observing an instance of CSTP having seen an Instance of A, B or C C A B
Interest Measures: Statistical Interpretation Spatial Statistics: ST K-Function (Diggle et al. 1995) • Cascade Participation Index (CPI) is an upper bound to the ST K-Function Example: Time Axis X Axis Y Axis 8
CSTP Miner Algorithm: Overview Filtering Choice R Upper Bound Filter CPI Threshold • CPI computation involves ST Join. • ST Join • Sort-merge over time • Nested loop over space. • Computational Bottleneck! Candidate Generation* Cycle checking Cycles Removed Multi-resolution Filter Pruned CSTPs Compute CPI Prune CSTP Prevalent CSTPs *using same strategy as [Kuramochi and Karypis’04] 9
Filtering strategies • Enhance Savings : Filter Non-prevalent CSTPs before CPI computation • Before Candidate Generation: Upper bound (UB)filter Key Idea • CPI has anti-monotone upper bound. • After Candidate Generation: Multi-resolution ST(MST) filter Key Idea • There exists a low dimensional embedding in space and time. • Over estimate CPI by coarsening ST dataset. • If Overestimate (CPI) < Threshold : Pruned 10
Evaluation Goals What is the effect of # event types on execution time ? What is the effect of CPI threshold ? c. Other experiments: Effect of Neighborhood size, Dataset size, Grid Parameters • Real Dataset: City of Lincoln, Nebraska, Year 2007 • Matlab 7.0 , X5355 2.66 GHZ with 16 GB Main Memory and Linux OS • Events within an interval of 10 minutes were assigned the same time stamp.
Experimental Analysis Questions b. What is the effect of CPI threshold ? a. What is the effect of # event types ? Fixed parameters : a. CPI = 0.2 b. Time Neighborhood = 1750 Time stamps. Fixed parameters : a. # of event types = 5 b. Time Neighborhood = 1750 Time stamps. Trends: a. Patten size is exponential in the number of event types. b. MST filter enhances computational savings. 12
Lincoln, NE crime dataset: Case study • Is bar closing a generator for crime related CSTP ? Bar locations in Lincoln, NE Questions • Observation: Crime peaks around bar-closing! • Is bar closing a crime generator ? • Are there other generators (e.g. Saturday Nights )? K.S Test: Saturday night significantly different than normal day bar closing (P-value = 1.249x10-7 , K =0.41)
Conclusions • Cascading ST Patterns are useful in applications like Public Safety and Climate change science. • ST Multi-resolution filtering enhances computational performance. • Complementary filtering strategies. • Statistically interpretable interest measure. Future work • New interest measure alternatives. • Qualitative Comparison with Graphical Models (e.g. Dynamic Bayes Nets, Hidden Markov Models etc.)
Acknowledgment • Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities. • This Work was supported by Grants from USARMY and NSF. Thank You for your Questions, Comments and Patience! 15
Crime Report Schema Alignment University of Texas at Dallas
Two different tables from two different data sources. Our goal is to align attributes between two tables. Washington DC Incidents Reported Lincoln_Nebraska Incidents Reported Overview
Heterogeneity Washington DC Lincoln Crime Crime_type Dataset ER Diagram crime Incident_2007_reported Incident_2007_reported located located located located Football Match Bars Football Match Bars Crime is an attribute in Washington DC Dataset, while it is a table in Lincoln Dataset.
Schema Alignment • Syntactic Matching: Keyword-based matching on Crime name • Lincoln.CrimeType. IncidentClassification = “Robbery” • Washington.Crime = “Robbery” • Semantic Matching: Semantically Relevant A. Specialization vs. Generalization • Lincoln.CrimeType. IncidentClassification = “Death” • Washington.Crime = “Homicide” • Death is super class of Homicide B. Finding Semantic Matching • Definition of Crimes Using shared Words to determine Similarity • Relevant Words Find relevant words using K-medoid Clustering and Normalized Google Distance (NGD) * * Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Geographically-Typed Semantic Schema Matching,” In Proc. of ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2009), Seattle, Washington, USA, November 2009. Extended Version Submitted to Journal of Web Semantics, Springer.
I. Finding Semantic Matching using Definition of Crime • Finding shared words to determine similarity • Larceny-Theft: Unlawful taking, carrying, leading, or riding away of property from the possession or constructive possession of another; attempts to do these acts are included in the definition. [1] • Theft: Illegal taking of another person's propertywithout that person's freely-given consent. [2] • Assault: An act that causes another to apprehend an immediate harmful contact. [3] • Red keywords are common words in crime definitions, while blue keywords are not common.. [1] http://www.fbi.gov/ucr/cius_04/offenses_reported/property_crime/larceny-theft.html [2] http://en.wikipedia.org/wiki/Theft [3] http://en.wikipedia.org/wiki/Assult
II. K-medoid+ NGD Instance Similarity Extract distinct keywords from compared columns Step 1 C1 C2 Lincoln Washington DC Keywords extracted from columns = {Arson, Theft, Stolen, …} Group distinct keywords together into semantic clusters Step 2 “Arson”,”Theft”,”Burglary”,…. “Arson”,”Theft”,”Northwest”…. : Column 1 : Column 2 C1UC2 Similarity = H(C|T) / H(C) Step 3 Calculate Similarity