1 / 19

Modeling Big Data

Modeling Big Data. Execution speed limited by: Model complexity Software Efficiency Spatial and temporal extent and resolution Data size & access speed Hardware performance. Combinatorials. If it takes 1 hour to run processing on Humboldt County: 10,495 km² Entire US: 39 days.

Download Presentation

Modeling Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Big Data • Execution speed limited by: • Model complexity • Software Efficiency • Spatial and temporal extent and resolution • Data size & access speed • Hardware performance

  2. Combinatorials • If it takes 1 hour to run processing on Humboldt County: 10,495 km² • Entire US: 39 days

  3. Vector or Array Computing • Super computers of the 60’s, 70’s, and 80’s • Harvard Architecture: • Separate program and data • 2^n Processors execute the same program on different data • Vector arithmetic • Limited flexibility

  4. Von Neumann Architecture • Instructions and data share memory • More flexible • Allows for one task to be divided into many processes and executed individually CPU ALU Cache RAM Instructions Data I/O

  5. Applications and Services

  6. Multiprocessing • Multiple processors (CPUs) working on the same task • Processes • Applications: have a UI • Services: Run in background • Can be executed individually • See “Task Manager” • Processes can have multiple “threads”

  7. Processes

  8. Threads • A process can have lots of threads • ArcGIS now can have 2, one for the GUI and one for a geoprocessing task • Obtain a portion of the CPU cycles • Must “sleep” or can lockup • Share access to memory, disk, I/O

  9. Distributed Processing • Task must be broken up into processes that can be run independently or sequentially • Typically: • Command line-driven • Scripts or compiled programs • R, Python, C++, Java, PHP, etc.

  10. Distributed Processing • Grid – distributed computing • Beowulf – lots of simple computer boards (motherboards) • Condor – software to share free time on computers • “The Cloud?” – web-based “services”. Should allow submittal of processes in the future.

  11. Trends • Processors are not getting faster • The internet is not getting faster • RAM continues to decrease in price • Hard discs continue to increase in size • Solid State Drives available • Number of “Cores” continues to increase

  12. Future Computers? • 128k cores, lots of “cache” • Multi-terabyte RAM • Terabyte SSD Drives • 100s of terabyte hard discs? • Allows for: • Large datasets in RAM (multi-terabyte) • Event larger datasets on “hard disks” • Lots of tasks to run simultaneously

  13. Reality Check • Whether through local processing or distributed processing: • We will need to “parallelize” spatial analysis in the future to manage: • Larger datasets • Larger modeling extends and finer resolution • Move complex models • Desire: • Break-up processing into “chunks” that can be each executed somewhat independently of each other

  14. Challenge • Having all the software you need on the computer you are executing the task on • Virtual Application: Entire computer disk image sent to another computer • All required software installed. • Often easier to manage your own cluster • Programs installed “once” • Shared hard disc access • Communication between threads

  15. Software • ArcGIS: installation, licensing, processing makes it almost impossible to use • Quantum, GRASS: installation make it challenging • FWTools, C++ applications, • Use standard language libraries and functions to avoid compatibility problems

  16. Data Issues • Break data along natural lines: • Different species • Different time slices • Window spatial data • Oversized • Vector data: size typically not an issue • Raster data: size is an issue

  17. Windowing Spatial Data • Raster arithmetic is natural • Each pixel result is only dependent on one pixel in the source raster = +

  18. Windowing Spatial Data • N x N filters: • Needs to use oversized windows Columns Rows

  19. Windowing Spatial Data • Others are problematic: • Viewsheds • Stream networks • Spatial simulations ScienceDirect.com

More Related