360 likes | 572 Views
HADOOP. Nagarjuna K. nagarjuna@outlook.com. Why and What Hadoop ?. A tool to process big data. nagarjuna@outlook.com. What is BIG Data ?. Facebook, Google+ etc., Machines too generate lots of data
E N D
HADOOP Nagarjuna K nagarjuna@outlook.com
Why and What Hadoop ? • A tool to process big data nagarjuna@outlook.com
What is BIG Data ? • Facebook, Google+ etc., • Machines too generate lots of data • We are having a online discussion now , certainly how many of us are in this conference will also be recorded as data. nagarjuna@outlook.com
What is BIG Data ? ..continued • Exponential growth of data challenges to Google, Yahoo, Microsoft, Amazon • Need to go through TBs and PBs of data ? • Which websites and books were popular ? • What kind of Ads appeal to them ? • Existing tools became inadequate to process such large data sets. nagarjuna@outlook.com
Why is the data so BIG ? • Till Couple of decade back Floppy disks • From then on CD/DVD Drives • Half a decade back Hard drives (500 GB) • Now Hard Drives(I TB) are available in abundance nagarjuna@outlook.com
Why is the data so BIG ? • So WHAT ? • Even the technology to read has taken a leap. nagarjuna@outlook.com
Why is the data so BIG ? nagarjuna@outlook.com
How to handle such BIG ? nagarjuna@outlook.com • BIG elephant • Numerous small chicken ?
How to handle such BIG ? • Concept of Torrents • Reduce time to read by reading it from multiple sources simultaneously. • Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes. nagarjuna@outlook.com
How to handle such BIG ? -- Issues • How to handle a system up and downs ? • How to combine the data from all the systems ? nagarjuna@outlook.com
Problem1 : System’s Ups and Downs • Commodity hard ware for data storage and analysis • Chances of failure are very high • So, have a redundant copy of the same data across some machines • In case of eventuality of one machine, you have the other • Google came up with a file system GFS (Google File System) which implemented all these details. nagarjuna@outlook.com
GFS • Divides data into chunks and stores in the file System • Can store data in ranges of PBs also nagarjuna@outlook.com
Problem 2 : How to combine the data ? • Analyze data across different machines , But how do we merge them to get a meaningful outcome ? • Yes, all (some) of the data has to travel across network. Then only merging of the data can occur. • Doing this is notoriously challenging • Again Google Map—Reduce nagarjuna@outlook.com
Map Reduce • Provides a programming model abstracts the problem of disk reads and writes transforming in to a computation of keys and values. • Two phases • Map • Reduce nagarjuna@outlook.com
So what is Hadoop ? • An operating system ? • Provides • A reliable shared storage system • Analysis system nagarjuna@outlook.com
History of Hadoop • Google was the first to launch GFS and MapReduce • They published a paper in 2004 announcing the world a brand new technology • This technology was well proven in Google by 2004 itself MapReduce paper by Google nagarjuna@outlook.com
History of Hadoop • Doug Cutting saw an opportunity and led the charge to develop an open sourceversion of this MapReduce system called Hadoop . • Soon after, Yahoo and othersrallied around to support this effort. • Now Hadoop is core part in : • Facebook, Yahoo, LinkedIn, Twitter … nagarjuna@outlook.com
History of Hadoop • GFS HDFS • MapReduce MapReduce nagarjuna@outlook.com
HDFS -- A Brief Design Streaming very large files on commodity cluster. • Very Large Files • MBs to PBs • Streaming • Write once read many approach • After huge data being placed We tend to use the data not modify it • Time to read the whole data is more important • Commodity Cluster • No High end Servers • Yes, high chance of failure (But HDFS is tolerant enoguh) • Replication is done nagarjuna@outlook.com
MapReduce -- A Brief • Large scale data processing in parallel. • MapReduce provides: • Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring • Two phases in MapReduce • Map • Reduce nagarjuna@outlook.com
MapReduce -- A Brief • Map phase • map (in_key, in_value) -> list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs • Reduce Phase • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one) nagarjuna@outlook.com
nagarjuna@outlook.com MapReduce -- A Brief
nagarjuna@outlook.com Hadoop Cluster
Version of Hadoop • We will deal with either of • Apache hadoop-0.20 • Clouderahadoop - cdh3 nagarjuna@outlook.com
Pre-Requisites • Core-Java • Acquaintance with LINUX will help. • For better experience :- Linux installation on your machines. nagarjuna@outlook.com
Thank you • Your feedback is highly important to improve our course material and teaching methodologies. • Please email your suggestions to nagarjuna@outlook.com