Efficient Handling of Massive (Terrain) Datasets

A A R H U S U N I V E R S I T E T Department of Computer Science Efficient Handling ofMassive (Terrain) Datasets Lars Arge

Massive Data Algorithmics • Massive data being acquired/used everywhere • Storage management software is billion-$ industry • Science is increasingly about mining massive data (Nature 2/06) Examples (2002): • Phone: AT&T 20TB phone call database, wireless tracking • Consumer: WalMart 70TB database, buying patterns • WEB: Google index 8 billion web pages • Geography: NASA satellites generate Terrabytes each day

Terrain Data • New technologies: Much easier/cheaper to collect detailed data • Previous ‘manual’ or radar based methods • Often 30 meter between data points • Sometimes 10 meter data available • New laser scanning methods (LIDAR) • Less than 1 meter between data points • Centimeter accuracy (previous meter) Denmark: • ~2 million points at 30 meter (<<1GB) • ~18 billion points at 1 meter (>>1TB) • COWI (and other) now scanning DK • NC scanned after Hurricane Floyd in 1999

read/write head read/write arm track magnetic surface Massive data = I/O-Bottleneck • I/O is often bottleneck when handling massive datasets • Disk access is 106 times slower than main memory access! • Disk systems try to amortize large access time transferring large contiguous blocks of data • Need to store and access data to take advantage of blocks! “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

I/O-efficient Algorithms • Taking advantage of block access important • Traditionally algorithms developed without block considerations • I/O-efficient algorithms leads to large runtime improvements Normal algorithm running time I/O-efficient algorithm datasize Main memory size

Scalability: Hierarchical Memory • Block access not only important on disk level • Machines have complicated memory hierarchy • Levels get largerandslower • Block transfers on all levels R A M L 1 L 2 running time datasize

My Research Work • Theoretically I/O- (and cache-) efficient algorithms work • Data structures, computational geometry, graph theory, … • Focus on Geographic Information Systems problems • Algorithm engineering work, e.g • TPIE system for simple, efficient, and portable implementation of I/O-efficient algorithms • Software for terrain data processing • LIDAR data handling • Terrain flow computations

Example: Terrain Flow • Terrain water flow has many important applications • Predict location of streams, areas susceptible to floods… • Conceptually flow is modeled using two basic attributes • Flow direction: The direction water flows at a point • Flow accumulation: Amount of water flowing through a point • Flow accumulation used to compute other hydrological attributes: drainage network, topographic convergence index… 7 am 3pm

Terrain Flow Accumulation • Collaboration with environmental researchers at Duke University • Appalachian mountains dataset: • 800x800km at 100m resolution  a few Gigabytes • On ½GB machine: • ArcGIS: • Performance somewhat unpredictable • Days on few gigabytes of data • Many gigabytes of data….. • Appalachian dataset would be Terabytes sized at 1m resolution • 14 days!!

Terrain Flow Accumulation: TerraFlow • We developed theoretically I/O-optimal algorithms • TPIE implementation was very efficient • Appalachian Mountains flow accumulation in 3 hours! • Developed into comprehensive software package for flow computation on massive terrains: TerraFlow • Efficient: 2-1000 times faster than existing software • Scalable: >1 billion elements! • Flexible: Flexible flow modeling (direction) methods • Extension to ArcGIS

LIDAR Terrain Data Work: TerraStream • Now TerraStream software “pipeline” for handling terrain data • Points to DEM (incl. breaklines) • DEM flow modeling (incl “flooding”, “flat” routing, “noise” reduce) • DEM flow accumulation (incl river extraction) • DEM hierarchical watershed computation • All work for both grid and TIN DEM’s • Capable of handling massive datasets • Test dataset: 400M point Neuse river basin (1/3 NC) (>17GB)

Examples of Ongoing Terrain Work • Terrain modeling, e.g • “Raw” LIDAR to point conversion (LIDAR point classification) (incl feature, e.g. bridge, detection/removal) • Further improved flow and erosion modeling (e.g. carving) • Contour line extraction (incl. smoothing and simplification) • Terrain (and other) data fusion (incl format conversion) • Terrain analysis, e.g • Choke point, navigation, visibility, change detection,… • Major grand goal: • Construction of hierarchical (simplified) DEM where derived features (water flow, drainage, choke points) are preserved/consistent

Thanks Lars Arge large@daimi.au.dk Work supported in part by • US National Science Foundation ESS grant EIA–98070734, RI grant EIA–9972879, CAREER grant EIA–9984099, and ITR grant EIA–0112849 • US Army Research Office grants W911NF-04-1-0278 and DAAD19-03-1-0352 • Ole Rømer Scholarship from Danish Science Research Council • NABIIT grant from the Danish Strategic Research Council • Danish National Research foundation (MADALGO center)

Efficient Handling of Massive (Terrain) Datasets