210 likes | 366 Views
Dan Bassett, Jonathan Canfield December 13, 2011. What is Hadoop ?. Allows for the distributed processing of large data sets across clusters of computers Open-source project written in Java Actively supported Inspired by a project that Google started. What’s the big deal?.
E N D
Dan Bassett, Jonathan Canfield December 13, 2011
What is Hadoop? • Allows for the distributed processing of large data sets across clusters of computers • Open-source project written in Java • Actively supported • Inspired by a project that Google started
What’s the big deal? • Changes the economics and dynamics of large scale computing • Scalable • Cost effective • Flexible • Fault Tolerant
Commercially supported • InfoSphereBigInsights • Silicon Graphics CloudRack • EMC Greenplum • Google App Engine • Oracle Big Data Appliance • ClouderaCDH, Professional Services • Microsoft Windows Server, SQL Server
Prominent Users • Facebook - claims to have the largest Hadoop cluster in the world at 30PB. • Yahoo! - claims to have the world’s largest Hadoop production application. • eBay – 5.3PB, 532 nodes cluster • New York Times – processed 4TB of image data into 11 million PDFs at cost of ~ $240
Architecture • Hadoop Common • HadoopDistributed File System (HDFS) • MapReduce Engine
File System (HDFS) • One big file system from many nodes • Fault-tolerant • Runs on low-cost commodity hardware
MapReduce Engine • Splits input data • Assigns work to nodes • Processed in parallel
Resources • Project Homehttp://hadoop.apache.org/ • Wikipediahttp://en.wikipedia.org/wiki/Apache_Hadoop • IBMhttp://www-01.ibm.com/software/data/infosphere/hadoop/