Efficient Algorithms for Large-Scale GIS Applications

Efficient Algorithms for Large-Scale GIS Applications Laura Toma Duke University

Why GIS? • How it all started.. • Duke Environmental researchers: • computing flow accumulation for Appalachian Mountains took 14 days (with 512MB memory) • 800km x 800km at 100m resolution  ~64 million points • GIS (Geographic Information Systems) • System that handles spatial data • Visualization, processing, queries, analysis • Indispensable tool • Modeling, analysis, prediction, decision making • Rich area of problems for Computer Science • Graphics, graph theory, computational geometry etc

Monitoring:keep an eye on the state of earth systems using satellites and monitoring stations (water, ecosystems, urban development) Modeling, simulation: predict consequences of human actions and natural processes Analysis and risk assessment:find the problem areas and analyse the possible causes (soil erosion, groundwater pollution, traffic jams…) Planning and decision support: provide information and tools for better management of natural and socio-economic resources GIS and the Environment

Precipitation in Tropical South America Lots of rain Dry H. Mitasova

Nitrogen in Chesapeake Bay High nitrogen concentrations H. Mitasova

Jockey’s Ridge evolution Combining IR-DOQQ, LIDAR and RTK GPS to assess the change: decreasing elevation, extending towards homes and a road C A B N H. Mitasova

Bald Head Island Renourishment 1998: LIDAR shoreline 1998 2000: LIDAR shoreline 2000 2001, Dec.: RTK GPS shoreline surface is 1998 LIDAR H. Mitasova

Sediment flow H. Mitasova

Reality: Height of terrain is a continuous function of two variables f(x,y) Estimate, predict, simulate Flooding, pollution Erosion, deposition Vegetation structure …. GIS: DEM (Digital Elevation Model) is a set of sample points and theirheights { x, y, hxy} Computations on Terrains Compute indices

3 3 3 3 2 2 2 2 4 4 4 4 7 7 7 7 5 5 5 5 8 8 8 8 7 7 7 7 1 1 1 1 9 9 9 9 DEM Representations TIN Grid Contour lines Sample points

Panama DEM

Modeling Flow on Terrains • What happens when it rains? • Predict areas susceptible to floods. • Predict location of streams. • Compute watersheds. • Flow is modeled using two basic attributes • Flow Direction (FD) • The direction water flows at a point • Flow Accumulation (FA) • Total amount of water that flows through a point (if water is distributed according to the flow directions)

Panama DEM - Flow Accumulation

Uses • Flow direction and flow accumulationare used for: • Computing other hydrological attributes • river network • moisture indices • watersheds and watershed divides • Analysis and prediction of sediment and pollutant movement in landscapes. • Decision support in land management, flood and pollution prevention and disaster management

Remote sensing technology Massive amounts of terrain data Higher resolutions (1km, 100m, 30m, 10m, 1m,…) NASA-SRTM Mission launched in 2001 Acquired data for 80% of earth at 30m resolution 5TB USGS Most of US at 10m resolution LIDAR 1m res Massive Terrain Data

Example: LIDAR Terrain Data • Massive (irregular) point sets (1-10m resolution) • Relatively cheap and easy to collect Example: Jockey’s ridge (NC coast)

It’s Growing! • Appalachian Mountains Area if approx. 800 km x 800 km Sampled at: • 100m resolution:  64 million points (128MB) • 30m resolution:  640 (1.2GB) • 10m resolution:  6400 = 6.4 billion (12GB) • 1m resolution:  600.4 billion (1.2TB)

Computing on Massive Data • GRASS (open source GIS) • Killed after running for 17 days on a 6700 x 4300 grid (approx 50 MB dataset) • TARDEM (research, U. Utah) • Killed after running for 20 days on a 12000 x 10000 grid (appox 240MB dataset) • CPU utilization5%, 3GB swap file • ArcInfo (ESRI, commercial GIS) • Can handle the 240MB dataset • Doesn’t work for datasets bigger than 2GB

Outline • Introduction • Flow direction and flow accumulation • Definitions, assumptions, algorithm outline. • Scalability to large terrains • Why not? • I/O-efficient algorithms • I/O-efficient flow accumulation • TerraFlow • Theoretical results • Conclusion

Flow Direction (FD) on Grids • Water flows downhill • follows the gradient • On grids: Approximated using 3x3 neighborhood • SFD (Single-Flow Direction): • FD points to the steepest downslope neighbor • MFD (Multiple-Flow direction) : • FD points to all downslope neighbors

Flow accumulation with MFD

Flow accumulation with SFD

Computing FD • Goal: compute FD for every cell in the grid (FD grid) • Algorithm: • For each cell compute SFD/MFD by inspecting 8 neighbor cells • Analysis: O(N) time for a grid of N cells • Is this all? • NO! flat areas: Plateas and sinks

FD on Flat Areas • …no obvious flow direction • Plateaus • Assign flow directions such that each cell flows towards the nearest spill point of the plateau • Sinks • Either catch the water inside the sink • Or route the water outside the sink using uphill flow directions • model steady state of water and remove (fill) sinks by simulating flooding: uniformly pouring water on terrain until steady state is reached • Assign uphill flow directions on the original terrain by assigning downhill flow directions on the flooded terrain

Flow Accumulation (FA) on Grids FA models water flow through each cell with “uniform rain” • Initially one unit of water in each cell • Water distributed from each cell to neighbors pointed to by its FD • Flow conservation: If several FD, distribute proportionally to height difference • Flow accumulation of cell is total flow through it Goal: compute FA for every cell in the grid (FA grid)

Computing FA • FD graph: • node for each cell • (directed) edge from cell a to b if FD of a points to b • FD graph must be acyclic • ok on slopes, be careful on plateaus • FD graph depends on the FD method used • SFD graph: a tree (or a set of trees) • MFD graph: a DAG (or a set of DAGs)

Computing FA: Plane Sweeping • Input: flow direction grid FD • Output: flow accumulation grid FA (initialized to 1) • Process cells in topological order. For each cell: • Read its flow from FA grid and its direction from FD grid • Update flow for downslope neighbors (all neighbors pointed to by cell flow direction) • Correctness • One sweep enough • Analysis • O(sort) + O(N) time for a grid of N cells • Note: Topological order means decreasing height order (since water flows downhill).

Dataset Size (log) Scalability Problem • We can compute FD and FA using simple O(N)-time algorithms • ..but.. for large sets..??

read/write head read/write arm track magnetic surface Scalability Problem: Why? • Most (GIS) programs assume data fits in memory • minimize only CPU computation • But.. Massive data does not fit in main memory! • OS places data on disk and moves data between memory and disk as needed • Disk systems try to amortize large access time by transferring large contiguous blocks of data • When processing massive data disk I/O is the bottleneck, rather than CPU time!

Disks are Slow “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Scalability to Large Data • Example: reading an array from disk • Array size N = 10 elements • Disk block size = 2 elements • Memory size = 4 elements (2 blocks) 1 5 2 6 3 8 9 4 7 10 1 2 10 9 5 6 3 4 8 7 Algorithm 1: Loads 10 blocks Algorithm 2: Loads 5 blocks N blocks >> N/B blocks • Block size is large (32KB, 64KB) N >> N/B N = 256 x 106, B = 8000 , 1ms disk access time  NI/Os take 256 x 103 sec = 4266 min = 71 hr  N/BI/Os take 256/8 sec = 32 sec

I/O model I/O-operation Read/write one block of data from/to disk I/O-complexity number of I/O-operations (I/Os) performed by the algorithm External memory or I/O-efficient algorithms: Minimize I/O-complexity • RAM model • CPU-operation • CPU-complexity • Number of CPU-operations performed by the algorithm • Internal memory algorithms: • Minimize CPU-complexity

I/O-Efficient Algorithms • O(N) I/Os is bad!! • Improve to O(N/B)I/Os(if possible) • Minimize the number of blocks transferred between main memory and disk • Compute on whole block while it is in memory • Avoid loading a block each time • Use techniques from PRAM algorithms

Sorting • Mergesort illustrates often used features: • Main memory sized chunks (for N/M runs) • Multi-way merge (repeatedly merge M/B of them)

Computing FAI/O-Analysis Algorithm:O(N) time • Process (sweep) cells in topological order. For each cell: • Read flow from FA grid and direction from FD grid • Update flow in FA grid for downslope neighbors • Problem: Cells of same height distributed over the terrain  scattered access to FA gridandFD gridO(N) blocks

[ATV00] I/O-Efficient Flow Accumulation • Eliminating scattered accesses to FD grid • Store FD grid in topological order • Eliminating scattered accesses to FA grid • Obs: flow to neighbor cell is only needed when its time comes to be processed: • Topological rank time when cell is processed priority • Push flow by inserting flow increment in priority queue with priority equal to neighbor’s priority • Flow of cell obtained using DeleteMin operations • Note: Augment each cell with priority of 8 neighbors • Obs: Space (~9N) traded for I/O • Turns O(N) grid accesses into O(N) priority queue operations • Use I/O-efficient priority queue [A95,BK97] • Buffered B-tree with with lazy updates

TerraFlow • TerraFlow is our suite of programs for flow routing and flow accumulation on massive grids[ATV`00,AC&al`02] • Flow routing and flow accumulation modeled as graph problems and solved in optimal I/O bounds • Efficient • 2-1000 times faster on very large grids than existing software • Scalable • 1 billion elements!! (>2GB data) • Flexible • Allows multiple methods flow modeling http://www.cs.duke.edu/geo*/terraflow

90 500 MHz Alpha, FreeBSD 4.0 TerraFlow 512 80 TerraFlow 128 70 ArcInfo 512 ArcInfo 128 60 50 Running Time (Hours) 40 30 20 10 0 Hawaii 56M Midwest 561M Lower NE 256M East-Coast 491M Washington 2G Cumberlands 80M TerraFlow • Significant speedup over ArcInfo for large datasets • East-Coast TerraFlow: 8.7 Hours ArcInfo: 78 Hours • Washington state TerraFlow: 63 Hours ArcInfo: % • GRASS cannot handle Hawaii dataset (killed after (17 days!)

I/O-Model D Parameters N = # elements in problem instance B = # elements that fit in disk block M = # elements that fit in main memory Fundamental bounds: • Sorting: sort(N) = Block I/O M P In practiceblock and main memory sizes are big

I/O-Efficient Graph Algorithms Graph G=(V,E) • Basic graph (searching) problems • BFS, DFS, SSSP, topological sorting • ..are big open problems in the I/O-model! • Standard internal memory algorithms: O(E) I/Os • No I/O-efficient algorithms are known for any of these problems on general graphs! • Lower bound Ω (sort(V)), best known Ω (V/sqrt(B)) • O(sort(E)) algorithms for special classes of graphs • Trees, grid graphs, bounded-treewidth graphs, outerplanar graphs, planar graphs • Exploit existence of small separators or geometric structure

SSSP on Grid Graphs [ATV’00] Grid graph O(N) vertices, O(N) edges Dijskstra’s algorithm: O(N) I/Os Goal: compute shortest path δ(s,t) in O(sort(N)) I/Os Lemma: The portion of δ(s,t)between intersection points with boundaries of subgrids is the shortest path within the subgrid

SSSP on Grid Graphs [ATV’00] Idea: Compute shortest paths locally in each subgrid then compute the shortest way to combine them together • Divide grid into subgrids of size BxB (assume M > B2) • Replace each BxB subgrid with complete graph on boundary nodes • Edge weight: shortest path between the two boundary vertices within the subgrid • Reduced graph GR O(N/B) vertices, O(N) edges

SSSP on Grid Graphs [ATV’00] Algorithm • Compute SSSP on GRfrom s to all boundary vertices • Find SSSP from s to all interior vertices: for any subgrid σ, for any t in σ δ(s,t) = min v in Bnd(σ){δ(s,v) + δ σ(v,t)} • Correctness: • easy to show using Lemma • Analysis: O(sort(N)) I/Os • Dijkstra algorithm using I/O efficient priority queue and graph blocking

DFS O(sort(N)) I/Os [AMTZ’01] O(sort(N)) I/Os [ABT’00] O(sort(N)) I/Os [ABT’00] ε-separators BFS SSSP Results on Planar graphs Planar graph G with N vertices • Separators can be computed in O(sort(N)) I/Os • I/O-efficient reductions [ABT’00, AMTZ’01]  BFS, DFS, SSSP in O(sort(N)) I/Os

SSSP on Planar Graphs • Similar with grid graphs. AssumeM > B2, bounded degree • Assume graph is separated • O(N/B2) subgraphs, O(B2) vertices each, S=O(N/B) separators • each subgraph adjacent to O(B) separators

Efficient Algorithms for Large-Scale GIS Applications

Efficient Algorithms for Large-Scale GIS Applications

Presentation Transcript

SQL Server: A Data Platform for Large-Scale Applications

GPU Requirements for Large Scale Scientific Applications

Efficient Parallel Software for Large-Scale Semidefinite Programs

EFFICIENT DYNAMIC VERIFICATION ALGORITHMS FOR MPI APPLICATIONS

Adaptive Performance Optimization for Large Scale Web Applications

Efficient Algorithms for Matching

Efficient Large-Scale Stereo Matching ACCV 2010

Efficient Large-Scale Structured Learning

LARGE SCALE

Stability Analysis Algorithms for Large-Scale Applications

Efficient Eigensolvers for Large-scale Electronic Nanostructure Calculations

Constraint-Driven Large Scale Circuit Placement Algorithms

Efficient Algorithms for Large-Scale Topology Discovery

Efficient Data Collection for Large-Scale Mobile Monitoring

Large Scale Applications

Data Mining Algorithms for Large-Scale Distributed Systems

Efficient Parallel and Distributed Algorithms for GIS polygon overlay

XML Fragment Caching for Large-Scale Mobile Commerce Applications

Algorithms and Software for Large-Scale Nonlinear Optimization

Constraint-Driven Large Scale Circuit Placement Algorithms

Efficient Algorithms for Large-Scale GIS Applications