190 likes | 287 Views
Modeling Big Data. Execution speed limited by: Model complexity Software Efficiency Spatial and temporal extent and resolution Data size & access speed Hardware performance. Combinatorials. If it takes 1 hour to run processing on Humboldt County: 10,495 km² Entire US: 39 days.
E N D
Modeling Big Data • Execution speed limited by: • Model complexity • Software Efficiency • Spatial and temporal extent and resolution • Data size & access speed • Hardware performance
Combinatorials • If it takes 1 hour to run processing on Humboldt County: 10,495 km² • Entire US: 39 days
Vector or Array Computing • Super computers of the 60’s, 70’s, and 80’s • Harvard Architecture: • Separate program and data • 2^n Processors execute the same program on different data • Vector arithmetic • Limited flexibility
Von Neumann Architecture • Instructions and data share memory • More flexible • Allows for one task to be divided into many processes and executed individually CPU ALU Cache RAM Instructions Data I/O
Multiprocessing • Multiple processors (CPUs) working on the same task • Processes • Applications: have a UI • Services: Run in background • Can be executed individually • See “Task Manager” • Processes can have multiple “threads”
Threads • A process can have lots of threads • ArcGIS now can have 2, one for the GUI and one for a geoprocessing task • Obtain a portion of the CPU cycles • Must “sleep” or can lockup • Share access to memory, disk, I/O
Distributed Processing • Task must be broken up into processes that can be run independently or sequentially • Typically: • Command line-driven • Scripts or compiled programs • R, Python, C++, Java, PHP, etc.
Distributed Processing • Grid – distributed computing • Beowulf – lots of simple computer boards (motherboards) • Condor – software to share free time on computers • “The Cloud?” – web-based “services”. Should allow submittal of processes in the future.
Trends • Processors are not getting faster • The internet is not getting faster • RAM continues to decrease in price • Hard discs continue to increase in size • Solid State Drives available • Number of “Cores” continues to increase
Future Computers? • 128k cores, lots of “cache” • Multi-terabyte RAM • Terabyte SSD Drives • 100s of terabyte hard discs? • Allows for: • Large datasets in RAM (multi-terabyte) • Event larger datasets on “hard disks” • Lots of tasks to run simultaneously
Reality Check • Whether through local processing or distributed processing: • We will need to “parallelize” spatial analysis in the future to manage: • Larger datasets • Larger modeling extends and finer resolution • Move complex models • Desire: • Break-up processing into “chunks” that can be each executed somewhat independently of each other
Challenge • Having all the software you need on the computer you are executing the task on • Virtual Application: Entire computer disk image sent to another computer • All required software installed. • Often easier to manage your own cluster • Programs installed “once” • Shared hard disc access • Communication between threads
Software • ArcGIS: installation, licensing, processing makes it almost impossible to use • Quantum, GRASS: installation make it challenging • FWTools, C++ applications, • Use standard language libraries and functions to avoid compatibility problems
Data Issues • Break data along natural lines: • Different species • Different time slices • Window spatial data • Oversized • Vector data: size typically not an issue • Raster data: size is an issue
Windowing Spatial Data • Raster arithmetic is natural • Each pixel result is only dependent on one pixel in the source raster = +
Windowing Spatial Data • N x N filters: • Needs to use oversized windows Columns Rows
Windowing Spatial Data • Others are problematic: • Viewsheds • Stream networks • Spatial simulations ScienceDirect.com