Discovering P ersistent Change Windows in Big Spatiotemporal Datasets A summary of results

Discovering Persistent Change Windows in Big Spatiotemporal DatasetsA summary of results Xun Zhou, ShashiShekhar, Dev Oliver 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial 2013) Nov. 5, 2013

Outline • Motivation • Problem Formulation • Challenges • Our Contribution • Novelty • Validation

Motivation (1) • Understanding climate and environmental changes • A global challenge: where, when, how, why? • Deforestation: forest is logged down at a certain speed • Desertification: grassland turned into desert • Urban changes: city sprawl, irrigation (vegetation increase). • Detecting changes: An essential step • Where and when Desertification Deforestation Urban sprawl

Motivation (2) • Big Data for climate and earth science • Land cover data at various resolutions: MODIS, Landsat, etc. • Help domain scientists find potential regions of interests: desertification, deforestation, urban sprawl… • Google time lapse: Amazon deforestation [1] • Our goal: • Find a spatial window and a time period where data value (e.g., vegetation cover) change at a certain high speed 2012 1998 1984 [1]. Google Earth Engine, https://earthengine.google.org/#intro/

Problem Formulation: Basic Concepts • Spatiotemporal Windows • A spatial field S, each location si has a time series of length |T| • Spatial window: a rectangular area in S. • ST window: a pair of <spatial window Sj, time interval Tj> • Spatial aggregated time series • For a spatial window Sj, TSj ={ , ,…, } • x(si, 1), x(si, 2),… are values in location siat time 1, 2, … |T| • SUM can be replaced by AVG, etc. • Average change rate (ACR): • For a ST window, ACR(Sj, Tj) = [TSj(t1) – TSj(tn)]/TSj(t1)/(tn-t1), Tj = [t1, tn] • Persistent Change Window (PCW): ACR ≥ threshold • “Total (average) vegetation cover in an area change at an average rate of … in a few years”

Problem statement • Given： • A spatial time series with |S| = M x N locations, and |T| time steps. • A threshold r of average change rate (ACR) • Minimum window size Smin and minimum time length Tmin • Find: • All the ST persistent change windows <Si, Ti> where ACR(Si, Ti)≥ r • Objective: • Reduce computational cost • Constraints: • |Si| ≥ Smin and |Ti| ≥Tmin • <Si, Ti> is not a subset of any other window <S’, T’>, such that Si S’ and Ti T’ • Completeness & Correctness

Examples Threshold: 15% Smin= 6, Tmin = 2 Red box (3x3) for T=[1,4] ACR = 16.5% Yellow box (2x4) for T=[3,4] ACR = 14.5% Output: <Red-box, [1, 4]>

Challenges • Large number of candidates （big combinatorics) • M2xN2xT2 candidate patterns (M x N locations, T time steps). • Pattern lack of monotonicity • Temporal pattern may have non-interesting part • Sub-regions in a window may be non-interesting • Large dataset: • 250m MODIS tile: 4800 by 4800 pixels and 250 snapshots • Hundreds of such tiles in the dataset • Terabyte data volume

Contributions • Formulate the Persistent Change Window (PCW) discovery problem • A ST window enumeration and pruning (SWEP) approach • Theoretical analysis : correctness, completeness, and space/time complexity • Case study on MODIS NDVI data • Experiments: scalability w.r.t. data volume and input parameters.

Related work Spatiotemporal Change pattern discovery Other footprints Persistent (arbitrary long interval in long time series) Zonal Change (Our work) Time point (CUSUM[2]) or interval [7] in single time series Zonal change at time point (ST scan statistics [3, 4]) Local/zonal change across few snapshots (image differencing, object-based change detection[5,6])

Baseline solution: Naïve approach • Two step framework (N,M= sides of spatial field, T=# time steps) • Step 1: • Enumerate all the pairs of {spatial window, time interval} and generate aggregated time series for each window • Find interesting intervals for each spatial window and add to candidate set • O(N3 x M3 x T3) • Step 2: • For each window-interval pair (S, T) in candidate set, prune all the pairs that are dominated by it. • O(k2) where k is the total number of candidates from step 1 • K = O(N2x M2x T2) in the worst case • Time complexity: • O(M x N x T)4 in worst case

A ST Window Enumeration-n-Pruning approach (SWEP) • Step 0: • Scan all the windows with left-top corner (1,1) and build a lookup table for all spatial windows • O(M x N x T) time cost, O(M x NX T) memory cost • Step 1: • Two level BFS enumeration of all ST windows • Outer loop: Enumerate all the LBN locations from (1,1,1) • Find the enumeration space for the current LBN using record • Inner loop: enumerate all the “valid RTF” for each LBN • Record all the WPCs found in this iteration • Step 2: • Refine step not needed. No dominated ST window will be generated.

Step 0: Window sum lookup table (1,1) B A Target area T = 4 D C SUM(Target area) = D – B – C + A

Step 1: Two-level Enumeration (1) • Enumerate 3-D ST windows in the dataset using two corner locations • BFS on the Left-bottom-near (LBN) and Right-top-far (RTF) locations • Avoid visiting dominated ST windows Enumeration of RTF for each LBN LBN and RTF representation of a window Enumeration of LBN location • Challenge: Record discovered PCWs for later pruning • For each LBN, record the discovered PCWs • For later LBNs, skip RTF locations inside these PCWs W1  W1 = <LBN1, RTF1> is a PCW. For LBN2 , we don’t need to test RTFs inside W1.

Step 1: Two-level enumeration (2) • A six-dimensional enumeration space (3, 3, 3) <(1,1,1), (3,3,3)> (3, 2, 3) (1, 1, 3) <(2,1,1), (3,3,3)> <(1,2,1), (3,3,3)> <(1,1,2), (3,3,3)> <(1,1,1), (2,3,3)> <(1,1,1), (3,2,3)> <(1,1,1), (3,3,2)> (1, 1, 2) <(1,1,2), (3,3,2)> <(2,1,1), (3,3,2)> <(1,1,1), (2,3,2)> <(1,2,1), (3,3,2)> <(1,1,1), (3,2,2)> <(1,1,1), (3,3,1)> <(1,1,2), (3,2,3)> <(2,1,1), (3,2,3)> <(1,1,1), (2,2,3)> <(1,2,1), (3,2,3)> <(1,1,1), (2,2,3)> <(1,1,1), (3,2,2)> (1, 1, 1) (2, 1, 1) (3, 1, 1) <(1,1,2), (2,3,3)> <(2,1,1), (2,3,3)> <(1,1,1), (1,3,3)> <(1,2,1), (2,3,3)> <(1,1,1), (3,1,3)> <(1,1,1), (3,3,1)> <(1,1,3), (3,3,3)> <(2,1,2), (3,3,3)> <(1,2,2), (3,3,3)> <(1,1,2), (2,3,3)> <(1,1,2), (3,2,3)> <(1,1,2), (3,3,2)> <(1,2,2), (3,3,3)> <(2,2,1), (3,3,3)> <(1,2,1), (2,3,3)> <(1,3,1), (3,3,3)> <(1,2,1), (3,2,3)> <(1,2,1), (3,3,2)> <(2,1,2), (3,3,3)> <(3,1,1), (3,3,3)> <(2,1,1), (2,3,3)> <(2,2,1), (3,3,3)> <(2,1,1), (3,2,3)> <(2,1,1), (3,3,2)>

Evaluations • Theoretical • Correct • Complete • Time & space complexity • Case study • Land cover data: MODIS 250m NDVI Data • Experimental Evaluation • Change data volume (with fixed time length) • Change data volume (with fixed number of locations) • Change the location of pattern in the search space

Theoretical analysis • The SWEP algorithm is correctness • The SWEP algorithm is complete • Space/time complexity (k = MxNxT) Worst Scenario Best Scenario O(k4) Naive Naive O(k3) O(k2) SWEP O(k) SWEP O(k3) O(k) O(k4) O(k2)

Case study Study area • Initial Results • MODIS 250m NDVI data (16 days) • Time:2000-2012. Annual: July 27/28 of each year.  Results of the proposed algorithm with average change rate >= 10% (outlined window) 2006 2001 2012 Average NDVI in outlined window Irrigation in Saudi Arabia, shown by Google Time lapse [1] 2001 2006 2012 An annual increase of 11.5%, 2001-2012

Experiments • Questions: • What is the impact of the data volume on run-time? • What is the impact of the pattern size on run-time? • Synthetic data • Data volume (area size, time length) • Pattern size (pattern volume ratio, PVR) • PVR = max pattern volume/ST data volume • Settings: • Matlab 2013 Under Linux • HP ProLiant BL280c G6 blade servers, with a quad-core 2.8 GHz Intel Xeon X5560 processor and 24 GB shared memory

Impact of Varying Dataset Size • Fixed PVR = 0.1 (worst case), varying data volume with fixed T = 20 • Fixed PVR = 0.95 (best case), varying data volume with fixed T = 20 • Fixed PVR = 0.1 (worst case), varying data volume with fixed |S| = 2500 • Fixed PVR = 0.95 (best case), varying data volume with fixed |S| = 2500 2000 8000 18000 32000 50000 Data Volume (# values) 2000 8000 18000 32000 50000 Data Volume (# values) 25000 50000 75000 100000 125000 Data Volume (# values) 25000 50000 75000 100000 125000 Data Volume (# values)

Impact of Varying Pattern Size • Fixed M = N = T, 125000 total data points, varying PVR from 0.1 (worst case) to 1 (best case) Summary: SWEP is orders of magnitude faster than Naïve algorithm with respect to (1) data volume and (2) the pattern size.

Conclusion and Future work • The PCW discovery problem is defined • A space-time window enumeration and pruning (SWEP) approach is proposed to mine PCW patterns • Correct, complete and faster. • Case study primarily show usefulness. • Future work • Accelerate the approach using parallel computing (e.g., CUDA) • Improve the SWEP algorithm (e.g., multi-resolution enumeration) • More case studies on remote sensing datasets (e.g., Amazon deforestation) to compare with known results (e.g., Google Time lapse).

Acknowledgements & References • Acknowledgements • NSF, USDOD for funding projects. • Minnesota Supercomputing Institute (MSI) • Spatial DB & DM group @ UMN References [1] Google Engine（https://earthengine.google.org/#intro/） [2] Basseville, Michele, and Igor V. Nikiforov. "Detection of abrupt changes: theory and applications." Journal of the Royal Statistical Society-Series A Statistics in Society 158.1 (1995): 185. [3] Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics-Theory and methods, 26(6), 1481-1496. [4] M. Kulldorff. Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society: Series A (Statistics in Society),164(1):61--72, 2001. [5] Coppin, Pol, et al. "Review Article Digital change detection methods in ecosystem monitoring: a review." International journal of remote sensing 25.9 (2004): 1565-1596. [6] A. Singh. Review article digital change detection techniques using remotely-sensed data. International journal of remote sensing, 10(6):989--1003, 1989. [7] X. Zhou, S. Shekhar, P. Mohan, S. Liess, and P. K. Snyder. Discovering interesting sub-paths in spatiotemporal datasets: A summary of results. In19th ACM SIGSPATIAL GIS, pages 44-53. ACM,2011.

Step 1: Two-level enumeration (2) (3,3,3) • Find the space to enumerate in each round • Skip any location that • Falls into the union of existing PCWs • “Covered space” of a LBN • The minimum set of RTF locations to traverse for each LBN • The “Covered space” of a LBN is a subset of the “covered space” of its predecessors. • The space to traverse for each LBN is • The intersection of covered space of all its direct parents [proof] (2,2,3) PCW  (1,1,1)

Step 1: Two-level enumeration (3) • Record the traversal space of each LBN location • Intersection of covered space of all the parents • Put all the “covered space“ as a list of “3-D Boolean maps” • Use a pointer array to link LBN with a “map” • Merge duplicate “maps” Map1 Map2 Map3 Map k LBN1 LBN2 LBN3 LBN4 List of covered space

Theoretical analysis • The SWEP algorithm is correctness • The SWEP algorithm is complete • Space/time complexity

Discovering P ersistent Change Windows in Big Spatiotemporal Datasets A summary of results

Discovering P ersistent Change Windows in Big Spatiotemporal Datasets A summary of results

Presentation Transcript

SUMMARY OF THE RESULTS

Summary of Results

P ersistent sciatic artery: report of a case and review of the literature

P ersistent O rganic P ollutant s (POPs)

Summary of Trigger and Physics Datasets Activities

A Little Change A Big Difference

Interesting Interval Discovery on Spatiotemporal Datasets

SUMMARY OF OPERATING RESULTS

Summary of Interim Results

Summary of Results:

Summary of Results

Discovering Sustainable RESULTS

Discovering Interesting Sub-paths in Spatiotemporal Datasets: A Summary of Results

Machine Learning from Big Datasets

WHIRL – summary of results

SUMMARY OF OPERATING RESULTS

Summary of Interim Results*

Summary of Previous Results

Summary of Workshop Results

Summary of Results

Summary of Results

Summary of Results