David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011

Satellite Image Processing And Production With Apache Hadoop David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011

Overview • Apache Hadoop • Applications, Environment and Use Case • Log Processing Example • EROS Science Processing Architecture (ESPA) and Hadoop • ESPA Processing Example • ESPA Implementation Strategy • Performance Results • Thoughts, Notes and Takeaway • Questions

Apache Hadoop – What is it? • Open source distributed processing system • Designed to run on commodity hardware • Widely used for solving “Big Data” challenges • Has been deployed in clusters with thousands of machines and petabytes of storage • Two primary subsystems: Hadoop Distributed File System (HDFS) and the MapReduce engine

Hadoop’s Applications • Web content indexing • Data mining • Machine learning • Statistical analysis and modeling • Trend analysis • Search optimization • … and of course, satellite image processing!

Hadoop’s Environment • Linux and Unix • Java based but relies on ssh for job distribution • Jobs written in any language executable from shell prompt • Java, C/C++, Perl, Python, Ruby, R, Bash, et. al.

Hadoop’s Use Case • Cluster of machines is configured into a Hadoop cluster • Each contributes: • Local compute resources to MapReduce • Local storage resources to HDFS • Files are stored in HDFS • File size is typically measured in gigabytes and terabytes • Job is run against an input file in HDFS • Target input file is specified • Code to run against input also specified

Hadoop’s Use Case • Unlike traditional systems which move data to the code, Hadoop flips this and moves code to the data • Two software functions comprise a MapReduce job • Map operation • Reduce operation • Upon execution: • Hadoop identifies input file chunk locations, moves the algorithms and executes the code • The “Map” • Sorts the Map results and aggregates final answer (single thread) • The “Reduce”

Log Processing Example

ESPA and Hadoop • Hadoop map code runs parallel on the input (log file) • Processes a single input file as quickly as possible • Reduce code runs on mapper output • ESPA processes satellite images, not text • Algorithms cannot run parallel within an image • Cannot use satellite images as the input • Solution: Use a text file with the image location as input. Skip the reduce step • Rather than parallelize within an image, ESPA handles many images at once

ESPA Processing Example

Implementation Strategy • LSRD is budget constrained for hardware • Other projects regularly excess old hardware upon warranty expiration • Take ownership of these systems… if they fail, they fail • Also ‘borrow’ compute and storage from other projects • Only network connectivity is necessary • Current cluster is 102 cores, minimal expense • Cables, switches, etc

Performance Results • Original throughput requirement was 455 atmospherically corrected Landsat scenes per day • Currently able to process ~ 4800! • Biggest bottleneck is local machine storage input/output • Due to implementation of ftp’ing files instead of using HDFS as intended • Attempted to solve this with ram disk, not enough memory • Currently evaluating solid state disk

Thoughts and Notes • Number of splits on input file can be controlled via the dfs.block.size parameter • Therefore control number of jobs run against an input file • ESPA-like implementation does not require massive storage unlike other Hadoop instances • Input files are very small • Robust internal job monitoring mechanisms are usually custom-built

Thoughts and Notes • Jobs written for Hadoop Streaming may be tested and run without Hadoop • cat inputfile.txt | mapper.py | sort | reducer.py > out.txt • Projects can share resources • Hadoop is tunable to restrict resource utilization on a per machine basis • Provides instant productivity gains versus internal development • LSRD is all about science and science algorithms • Minimal time and budget for building internal systems

Takeaways • Hadoop is proven and tested • Massively scalable out of the box • Cloud based instances available from Amazon and others • Shortest path to processing massive amounts of data • Extremely hardware failure tolerant • No specialized hardware or software needed • Flexible job API allows existing software skills to be leveraged • Industry adoption means support skills available

Questions

David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011