Let's Break It Up: Using Informix with Hadoop

Let's Break It Up: Using Informix with Hadoop Pradeep Natarajan Session: C13 IBM Corp. Tue 5/17/2011 04:40p

Agenda • Hadoop – What? • History • Hadoop – Why? • Hadoop Distributed File System • MapReduce algorithm • Hadoop – How? • Informix and Apache Hadoop Session C13 2/20/11 2

Session C13 Hadoop Overview • What is Hadoop? • Framework for very large scale data processing • Open source Apache project • Written in Java • Runs on Linux, Mac OS/X, Windows, and Solaris • Hadoop core • Distributed file system • API & implementation of MapReduce • Web-based interface to monitor the cluster’s health

Hadoop Timeline • 2003 – Google’s GFS paper • 2004 – MapReduce paper • 2005 – Nutch using MapReduce • 2006 – Hadoop moves out of Nutch • 2007 – Yahoo! running 1000 node Hadoop cluster • 2008 – Hadoop becomes top-level Apache project • 2010 – IBM introduces a portfolio of solutions & services for Big Data: IBM InfoSphere BigInsights Session C13 2/20/11 4

Session C13 Hadoop Overview • Why Hadoop? • Large volume of data • 100s of terabytes or petabytes of data • Need to scale up (aka lots of nodes) • Distributed file system • Use cheap commodity hardware • In large clusters, nodes will fail • Automatic fail over • Fault tolerance through data replication • Common infrastructure across all nodes

Hadoop Cluster Source: Apache Hadoop • Typically 2-level architecture • Nodes are commodity Linux PCs • 40 nodes/rack • Uplink from rack is 8 gigabit • Rack-internal is 1 gigabit Session C13

Session C13 Hadoop Overview • When should you use Hadoop? • Processing lots of unstructured data • Ex. Web search, image analysis, searching log files • Parallelization is possible • Running batch jobs is acceptable • Access to cheap hardware (public cloud is acceptable?)

Session C13 Hadoop Overview • When to NOT use Hadoop? • Processor intensive operation with little data • Ex. Calculating 1000000th digit of π (Pi) • Job is not easily parallelizable • Data is not self-contained • Need interactive processing or state aware computation

Session C13 Hadoop Overview • Hadoop is NOT … • a replacement for RDBMS • suitable for indexed/structured data • a substitute for ALL your data warehouses • a substitute for High Availability SAN-hosted file system • a POSIX file system

Powered By Hadoop Session C13

Session C13 Hadoop Distributed File System (HDFS) • Petabyte file system for the cluster • Single namenode for the cluster • Files are append only (no seek()) • Optimized for streaming reads of large files • Data split into large blocks • Block size = 128MB (as opposed to 4KB in Unix) • Blocks are replicated to multiple datanodes

Session C13 HDFS Source: Apache Hadoop

Session C13 HDFS • Client • Intelligent • Talks to the name node to find location of blocks • Accesses data directly from the nearest data node replicas • Can only append to existing files

Session C13 HDFS • Name Node • Single name node per cluster • Manages file system namespace and metadata • Maps a file name to a set of blocks • Maps a block to the data nodes (replicas) • Data Node • Lots of them (1000s) • Manages data blocks and sends them to the client • Data is replicated; failure is expected

Session C13 HDFS – File write Source: Isabel Drost, FOSDEM 2010

Session C13 HDFS – File Read Source: Isabel Drost, FOSDEM 2010

Session C13 MapReduce Programming Model • Targets data intensive computations • Input data format – specified by user • Output – <key, value> pair • Map & Reduce – user specified algorithm Input Map Reduce Output <k,v> Intermediate <k, v>

Session C13 MapReduce Programming Model Source: Owen O’Malley, Yahoo!

Session C13 Hadoop MapReduce • Job tracker • One per cluster • Receives job requests from client • Schedules and monitors MR jobs on task trackers • Task tracker • Lots of them • Execute MR operations • Read blocks from data nodes

Session C13 Hadoop MapReduce Source: Isabel Drost, FOSDEM 2010

Session C13 Using Apache Hadoop • Requirements: Linux, Java 1.6, sshd, rsync • Configure SSH • Unpack Hadoop • Edit a few configuration files • Format the DFS on the name node • Start all the daemon processes

Session C13 Using Apache Hadoop • Steps for running a Hadoop job: • Compile your job into a JAR file • Copy input data into HDFS • Execute bin/hadoop jar with relevant args • Monitor tasks via Web interface (optional) • Examine output when job is complete

Session C13 Informix and Hadoop • Sqoop (SQL-to-Hadoop) • Command-line tool • Connects Hadoop and traditional database systems • Imports tables/databases from DBMS into HDFS • Generates Java classes to interact with the imported data • Export MR results back to a database

Session C13 Hadoop Informix database Sqoop Informix and Hadoop • Sqoop uses JDBC connection • IBM Data Server driver for JDBC (DRDA protocol) • sqoop --connect jdbc:ids://myhost.ibm.com:9198/stores_demo • --table CUSTOMER --as-sequencefile • Informix JDBC driver (SQLI protocol) • sqoop --connect jdbc:informix-sqli://myhost.ibm.com:9198/stores_demo:INFORMIXSERVER=ol_1170 • --table CUSTOMER --as-sequencefile

Session C13 References • Apache Hadoop wiki - http://wiki.apache.org/hadoop/ • Apache Hadoop – http://hadoop.apache.org • Sqoop wiki - https://github.com/cloudera/sqoop/wiki/ • Cloudera Sqoop - http://www.cloudera.com/blog/2009/06/introducing-sqoop/

Questions ?!? Session C13 2/20/11 27

Let's Break It Up: Using Informix with Hadoop Pradeep Natarajan IBM Corp. pnatara@us.ibm.com (913)599-7136

Let's Break It Up: Using Informix with Hadoop