HADOOP

HADOOP Nagarjuna K nagarjuna@outlook.com

Why and What Hadoop ? • A tool to process big data nagarjuna@outlook.com

What is BIG Data ? • Facebook, Google+ etc., • Machines too generate lots of data • We are having a online discussion now , certainly how many of us are in this conference will also be recorded as data. nagarjuna@outlook.com

What is BIG Data ? ..continued • Exponential growth of data  challenges to Google, Yahoo, Microsoft, Amazon • Need to go through TBs and PBs of data ? • Which websites and books were popular ? • What kind of Ads appeal to them ? • Existing tools became inadequate to process such large data sets. nagarjuna@outlook.com

Why is the data so BIG ? • Till Couple of decade back  Floppy disks • From then on  CD/DVD Drives • Half a decade back  Hard drives (500 GB) • Now  Hard Drives(I TB) are available in abundance nagarjuna@outlook.com

Why is the data so BIG ? • So WHAT ? • Even the technology to read has taken a leap. nagarjuna@outlook.com

Why is the data so BIG ? nagarjuna@outlook.com

How to handle such BIG ? nagarjuna@outlook.com • BIG elephant • Numerous small chicken ?

How to handle such BIG ? • Concept of Torrents • Reduce time to read by reading it from multiple sources simultaneously. • Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes. nagarjuna@outlook.com

How to handle such BIG ? -- Issues • How to handle a system up and downs ? • How to combine the data from all the systems ? nagarjuna@outlook.com

Problem1 : System’s Ups and Downs • Commodity hard ware for data storage and analysis • Chances of failure are very high • So, have a redundant copy of the same data across some machines • In case of eventuality of one machine, you have the other • Google came up with a file system  GFS (Google File System) which implemented all these details. nagarjuna@outlook.com

GFS • Divides data into chunks and stores in the file System • Can store data in ranges of PBs also nagarjuna@outlook.com

Problem 2 : How to combine the data ? • Analyze data across different machines , But how do we merge them to get a meaningful outcome ? • Yes, all (some) of the data has to travel across network. Then only merging of the data can occur. • Doing this is notoriously challenging • Again Google  Map—Reduce nagarjuna@outlook.com

Map Reduce • Provides a programming model  abstracts the problem of disk reads and writes transforming in to a computation of keys and values. • Two phases • Map • Reduce nagarjuna@outlook.com

So what is Hadoop ? • An operating system ? • Provides • A reliable shared storage system • Analysis system nagarjuna@outlook.com

History of Hadoop • Google was the first to launch GFS and MapReduce • They published a paper in 2004 announcing the world a brand new technology • This technology was well proven in Google by 2004 itself MapReduce paper by Google nagarjuna@outlook.com

History of Hadoop • Doug Cutting saw an opportunity and led the charge to develop an open sourceversion of this MapReduce system called Hadoop . • Soon after, Yahoo and othersrallied around to support this effort. • Now Hadoop is core part in : • Facebook, Yahoo, LinkedIn, Twitter … nagarjuna@outlook.com

History of Hadoop • GFS  HDFS • MapReduce  MapReduce nagarjuna@outlook.com

HDFS -- A Brief Design  Streaming very large files on commodity cluster. • Very Large Files • MBs to PBs • Streaming • Write once read many approach • After huge data being placed  We tend to use the data not modify it • Time to read the whole data is more important • Commodity Cluster • No High end Servers • Yes, high chance of failure (But HDFS is tolerant enoguh) • Replication is done nagarjuna@outlook.com

MapReduce -- A Brief • Large scale data processing in parallel. • MapReduce provides: • Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring • Two phases in MapReduce • Map • Reduce nagarjuna@outlook.com

MapReduce -- A Brief • Map phase • map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • Reduce Phase • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one) nagarjuna@outlook.com

nagarjuna@outlook.com MapReduce -- A Brief

nagarjuna@outlook.com Hadoop Cluster

Hadoop Ecosystems

Version of Hadoop • We will deal with either of • Apache hadoop-0.20 • Clouderahadoop - cdh3 nagarjuna@outlook.com

Pre-Requisites • Core-Java • Acquaintance with LINUX will help. • For better experience :- Linux installation on your machines. nagarjuna@outlook.com

Thank you  • Your feedback is highly important to improve our course material and teaching methodologies. • Please email your suggestions to nagarjuna@outlook.com

HADOOP

HADOOP

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop @

Cassandra + Hadoop

Hadoop Demo

Hola Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop concpet

Hadoop Administration