320 likes | 496 Views
MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer. Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006. Outline. Massive (terrain) data Scalability problems (I/O bottleneck)
E N D
MASSIVE Terrain Datasæt −om vigtigheden af effektive algoritmer Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006
Massive terrain datasæt Outline • Massive (terrain) data • Scalability problems (I/O bottleneck) • Processing massive terrain data: Flow modeling on grid terrains • Summary
Massive terrain datasæt Massive Data
Massive terrain datasæt Massive Data • Massive datasets are being collected everywhere • Storage management software is billion-$ industry Examples (2002): • Phone: AT&T 20TB phone call database, wireless tracking • Consumer: WalMart 70TB database, buying patterns (supermarket checkout) • WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day • Geography: NASA satellites generate 1.2TB per day
Massive terrain datasæt Example: Satellite Images • Terrabyte image database
Massive terrain datasæt Example: Grid Terrain Data • Grid terrain data increasingly available • NASA SRTM mission acquired 30m data for around 80% of earth land mass • US data readily available through USGS National Map Seamless Data Distribution System • Appalachian Mountains (800km x 800km) • 100m resolution ~ 64M cells ~128MB raw data (~500MB when processing) • ~ 1.2GB at 30m resolution • ~ 12GB at 10m resolution (much of US available from USGS) • ~ 1.2TB at 1m resolution (selected, mostly military, availability)
Massive terrain datasæt Example: LIDAR Terrain Data • Massive (irregular) point sets (1-10m resolution) • Becoming relatively cheap and easy to collect • NC floodplain mapping program: www.ncfloodmaps.com • Collected LIDAR for all NC after Hurricane Floyd in 1999 • Still processing it
Massive terrain datasæt Hurricane Floyd • Sep. 15, 1999 7 am 3pm
Massive terrain datasæt Example: LIDAR Terrain Data • US LIDAR data becoming available: • www.ncfloodmaps.com • USGS Center for LIDAR Information Coordination and Knowledge (CLICK) • NOAA LIDAR Data Retrieval Tool (LDART)
Massive terrain datasæt Scalability Problems
read/write head read/write arm track magnetic surface Massive terrain datasæt Scalability Problems: I/O-Bottleneck • I/O is often bottleneck when handling massive datasets • Disk access is 106 times slower than main memory access “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) • Disk systems try to amortize large access time transferring large contiguous blocks of data • Need to store and access data to take advantage of blocks (locality)
Massive terrain datasæt Scalability Problems: Block Access Matters • Example: Reading an array from disk • Array size N = 10 elements • Disk block size B = 2 elements • Main memory size M = 4 elements (2 blocks) • Difference between N and N/B large since block size is large • Example: N = 256 x 106, B = 8000 , 1ms disk access time NI/Os take 256 x 103 sec = 4266 min = 71 hr N/BI/Os take 256/8 sec = 32 sec 1 2 10 9 5 6 3 4 8 7 1 5 2 6 3 8 9 4 7 10 Algorithm 1: Loads N=10 blocks Algorithm 2: Loads N/B=5 blocks
running time data size Massive terrain datasæt Scalability Problems: Block Access Matters • Most programs developed without memory considerations • Infinite memory • Uniform access cost • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access Scalability problems! R A M
running time R A M data size Massive terrain datasæt Scalability: Hierarchical Memory • Block access not only important on disk level • Machines have complicated memory hierarchy • Levels get larger and slower • Block transfers on all levels • We focus on disk level: R A M L 2 L 1
Massive terrain datasæt Processing Massive Terrain Data: Flow
Massive terrain datasæt Flow on Terrains • Modeling of water flow on terrains has many important applications • Predict location of streams • Predict areas susceptible to floods • Compute watersheds • Predict erosion • Predict vegetation distribution • …… • Conceptually flow is modeled using two basic attributes • Flow direction: The direction water flows at a point • Flow accumulation: Amount of water flowing through a point • Flow accumulation used to compute other hydrological attributes, e.g. drainage network, topographic convergence index…
SFD MFD 3 3 3 3 2 2 2 2 4 4 4 4 7 7 7 7 5 5 5 5 8 8 8 8 7 7 7 7 1 1 1 1 9 9 9 9 Massive terrain datasæt Flow Directions on Grid Terrains • Common terrain representation: Grid • Flow directions: Water in each cell flows to downslope neighbor(s) • Commonly used: • Single flow direction (SFD or D8): Flow to downslope neighbor • Multiple flow direction (MFD): Flow to all downslope neighbors
Massive terrain datasæt Flow Accumulation on Grid Terrains • Flow accumulation • Initially one unit of water in each cell • Water distributed from each cell according to flow direction(s) • Flow accumulation of cell is total flow through it
Massive terrain datasæt Flow Accumulation Example (Panama dataset)
Massive terrain datasæt Flow Modeling on Massive Grid Terrains • Duke University Environmental researchers had problems with computing flow accumulation for Appalachian Mountains • Recall ~128MB raw data and ~500MB when processing Running time: 14 days • It could be much worse; Recall • ~ 1.2GB at 30m resolution • ~ 12GB at 10m resolution • ~ 1.2TB at 1m resolution
Massive terrain datasæt Flow Modeling on Massive Grid Terrains • We surveyed other flow accumulation software • GRASS (leading open-source GIS) • Killed after 17 days on a 50MB dataset (6700 x 4300 grid) • TARDEM (specialized hydrology software) • Could handle 50MB dataset • Killed after 20 days on a 240MB dataset (12000 x 10000 grid) • CPU utilization5%, 3GB swap file • ArcGIS (leading commercial GIS) • Could handle the 240MB dataset • Sometimes very slow: • 3 days to process 490MB dataset • 1 day to process 560MB dataset • Does not work for datasets larger than 2GB
Massive terrain datasæt Flow Accumulation Scalability Problem • Natural algorithm may require ~N I/Os • “Push” flow down the terrain by visiting cells in height order Problem since cells of same height scattered over terrain • Natural to try “tiling” (ArcGIS?) • But computation in different tiles not independent
Massive terrain datasæt TerraFlow • We developed theoretically I/O-optimal algorithms using ~N/B I/Os • Avoiding scattered access by: • Grid storing input: Data duplication • Grid storing flow: “Lazy write” • Implementation was very efficient • Appalachian Mountains flow accumulation in 3 hours! • Developed into comprehensive software package for flow computation on massive grids (www.cs.duke.edu/geo*/terraflow) • Efficient: 2-1000 times faster than other software on massive grids • Scalable: 1 billion elements! (>2GB data) • Flexible: Different flow modeling (direction) methods
500 MHz Alpha, FreeBSD 4.0 TerraFlow 512 90 TerraFlow 128 ArcInfo 512 80 ArcInfo 128 70 60 50 Running Time (Hours) 40 30 20 10 Hawaii 56M 0 Midwest 561M Lower NE 256M East-Coast 491M Washington 2G Cumberlands 80M Massive terrain datasæt TerraFlow • Significant speedup over ArcInfo for large datasets • East-Coast (100m) TerraFlow: 8.7 Hours ArcInfo: 78 Hours • Washington state (10m) TerraFlow: 63 Hours ArcInfo: % • Incorporated in Grass 5.0.2 and later • Recently also extensions for ArcGIS 8 and 9
Massive terrain datasæt Denmark?
Massive terrain datasæt Denmark Terrain Data • Mainly two data suppliers in Denmark • Kort & Matrikelstyrelsen • COWI A/S • Grid/vector models based on paper maps/ortofoto • LIDAR data for major cities • Unfortunately not available online (and not free) • But obviously increasing interest in terrain data/applications
Massive terrain datasæt New Project • New (NABIIT) project: Development of algorithms and software for processing massive terrain data • COWI A/S • Problems processing LIDAR data during production and analysis (e.g. railroad noise) • Spatial analysis unit, Danish Institute of Agricultural Sciences • Use data, e.g. to comply with EU directives • Computer science, Aarhus University • Efficient algorithms • Focus on • Terrain modeling, terrain flow analysis, influence of simplification
Massive terrain datasæt Example Sub-Projects • Terrain modeling, e.g: • Terrain models from “raw” LIDAR Process >10G raw data in a few hours using only 128M memory • Terrain analysis, e.g: • Erosion modeling (USLE factor computation) • Watershed hierarchy computation NC Neuse basin at 10m resolution (~400M cells) in 3 hours
Massive terrain datasæt Summary
Massive terrain datasæt Summary • Massive datasets appear everywhere • Leads to scalability problems • Due to hierarchical memory and slow I/O • I/O-efficient algorithms greatly improves scalability • Terrain data: • Massive grid data exists • New technologies are creating massive and very detailed datasets • Processing capabilities lag behind
Massive terrain datasæt Summary - Resources • Google earth: http://earth.google.com/ • USGS national map: http://seamless.usgs.gov • USGS center for LIDAR information: http:/lidar.cr.usgs.gov • NC floodmaps: http://www.ncfloodmaps.com • NOAA LIDAR data retrieval tool: http://www.csc.noaa.gov/crs/tcm/about_ldart.html • TerraFlow: http://www.cs.duke.edu/geo*/terraflow • Duke STREAM project: http://terrain.cs.duke.edu • Kort & Matrikelstyrelsen: http://www.kms.dk • COWI A/S: http://www.cowi.dk • Geoforum: http://www.geoforum.dk/
Massive terrain datasæt THANKS/TAK Lars Arge large@daimi.au.dk