1 / 16

David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011

Satellite Image Processing And Production With Apache Hadoop. David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011. Overview. Apache Hadoop Applications, Environment and Use Case Log Processing Example EROS Science Processing Architecture (ESPA) and Hadoop

teige
Download Presentation

David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Satellite Image Processing And Production With Apache Hadoop David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011

  2. Overview • Apache Hadoop • Applications, Environment and Use Case • Log Processing Example • EROS Science Processing Architecture (ESPA) and Hadoop • ESPA Processing Example • ESPA Implementation Strategy • Performance Results • Thoughts, Notes and Takeaway • Questions

  3. Apache Hadoop – What is it? • Open source distributed processing system • Designed to run on commodity hardware • Widely used for solving “Big Data” challenges • Has been deployed in clusters with thousands of machines and petabytes of storage • Two primary subsystems: Hadoop Distributed File System (HDFS) and the MapReduce engine

  4. Hadoop’s Applications • Web content indexing • Data mining • Machine learning • Statistical analysis and modeling • Trend analysis • Search optimization • … and of course, satellite image processing!

  5. Hadoop’s Environment • Linux and Unix • Java based but relies on ssh for job distribution • Jobs written in any language executable from shell prompt • Java, C/C++, Perl, Python, Ruby, R, Bash, et. al.

  6. Hadoop’s Use Case • Cluster of machines is configured into a Hadoop cluster • Each contributes: • Local compute resources to MapReduce • Local storage resources to HDFS • Files are stored in HDFS • File size is typically measured in gigabytes and terabytes • Job is run against an input file in HDFS • Target input file is specified • Code to run against input also specified

  7. Hadoop’s Use Case • Unlike traditional systems which move data to the code, Hadoop flips this and moves code to the data • Two software functions comprise a MapReduce job • Map operation • Reduce operation • Upon execution: • Hadoop identifies input file chunk locations, moves the algorithms and executes the code • The “Map” • Sorts the Map results and aggregates final answer (single thread) • The “Reduce”

  8. Log Processing Example

  9. ESPA and Hadoop • Hadoop map code runs parallel on the input (log file) • Processes a single input file as quickly as possible • Reduce code runs on mapper output • ESPA processes satellite images, not text • Algorithms cannot run parallel within an image • Cannot use satellite images as the input • Solution: Use a text file with the image location as input. Skip the reduce step • Rather than parallelize within an image, ESPA handles many images at once

  10. ESPA Processing Example

  11. Implementation Strategy • LSRD is budget constrained for hardware • Other projects regularly excess old hardware upon warranty expiration • Take ownership of these systems… if they fail, they fail • Also ‘borrow’ compute and storage from other projects • Only network connectivity is necessary • Current cluster is 102 cores, minimal expense • Cables, switches, etc

  12. Performance Results • Original throughput requirement was 455 atmospherically corrected Landsat scenes per day • Currently able to process ~ 4800! • Biggest bottleneck is local machine storage input/output • Due to implementation of ftp’ing files instead of using HDFS as intended • Attempted to solve this with ram disk, not enough memory • Currently evaluating solid state disk

  13. Thoughts and Notes • Number of splits on input file can be controlled via the dfs.block.size parameter • Therefore control number of jobs run against an input file • ESPA-like implementation does not require massive storage unlike other Hadoop instances • Input files are very small • Robust internal job monitoring mechanisms are usually custom-built

  14. Thoughts and Notes • Jobs written for Hadoop Streaming may be tested and run without Hadoop • cat inputfile.txt | mapper.py | sort | reducer.py > out.txt • Projects can share resources • Hadoop is tunable to restrict resource utilization on a per machine basis • Provides instant productivity gains versus internal development • LSRD is all about science and science algorithms • Minimal time and budget for building internal systems

  15. Takeaways • Hadoop is proven and tested • Massively scalable out of the box • Cloud based instances available from Amazon and others • Shortest path to processing massive amounts of data • Extremely hardware failure tolerant • No specialized hardware or software needed • Flexible job API allows existing software skills to be leveraged • Industry adoption means support skills available

  16. Questions

More Related