100062108 李智宇、 100062116 林威宏、 1 00062220 施閔耀

100062108 李智宇、 100062116 林威宏、 100062220 施閔耀

Outline • Introduction • Architecture of Hadoop • HDFS • MapReduce • Comparison • Why Hadoop • Conclusion 100062108 李智宇、100062116 林威宏、100062220 施閔耀

What is Hadoop ? • open-source software framework • process and store big data • Easy to use and implement, economic, flexible • lots of nodes(server) • written in JAVA • free license • created by Doug Cutting and Mike Cafarella in 2005 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Advantages of Interpreted Language • Cross-platform(ex: Windows, Ubuntu, Mac OS X) • smaller executable program size • easier to modify during both development and execution 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Architecture of Hadoop 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Hadoop in Enterprise The Dell representation of the Hadoop ecosystem. 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Hadoop in Enterprise 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Who is using Hadoop ? more than half of the Fortune 50 uses Hadoop by 2013 100062108 李智宇、100062116 林威宏、100062220 施閔耀

HDFS • Hadoop Distributed File System • Client: user • name node: manage and store metadata, namespace of files • Data node: store files • each data node sends its status to name node periodically 100062108 李智宇、100062116 林威宏、100062220 施閔耀

HDFS: Writing data in HDFS • Each file will be divided into blocks(in size 64 or 128MB) , and have three copies in different data nodes. • Client asks name node to get a list of data node sorted by distance, and send the file to the nearest one , then the data node will send the file to the rest node. • When above operation done, data node will send “done” to name node. 100062108 李智宇、100062116 林威宏、100062220 施閔耀

HDFS: Reading data in HDFS • Client send filename to the name node , then the name node will send a list of the blocks of files sorted by distance. • Client use the list to get the file from data node. 100062108 李智宇、100062116 林威宏、100062220 施閔耀

HDFS: failure • node failure • communication failure • data corruption 100062108 李智宇、100062116 林威宏、100062220 施閔耀

HDFS: handle failure • Handle writing failure:name node will skip the data node without an ACK. • Handle reading failure:recall that when reading a file, client will get a list of data node content the file. 100062108 李智宇、100062116 林威宏、100062220 施閔耀

HDFS: handle failure • Name node handle node failure : name node will find out the data the failure node have, and copy those data from others and restore them to other data node. • Note that HDFS can’t guarantee at least one copy of data is alive. 100062108 李智宇、100062116 林威宏、100062220 施閔耀

MapReduce • similar to divide-and-conquer • First, use “Map” to divide tasks • Second, use “Shuffle” to “transfer the data from the mapper nodes to a reducer’s node and decompress if needed. “ • Third, use “Reduce” to “execute the user-defined reduce function to produce the final output data. “ 100062108 李智宇、100062116 林威宏、100062220 施閔耀

MapReduce-Map 100062108 李智宇、100062116 林威宏、100062220 施閔耀

MapReduce-shuffle 100062108 李智宇、100062116 林威宏、100062220 施閔耀

MapReduce-Reduce 100062108 李智宇、100062116 林威宏、100062220 施閔耀

MapReduce 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Comparison 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Why Hadoop? technically Comparison of Grep Task Result with Vertica and DBMS-X 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Why Hadoop? Simple structure vs. Optimization Transaction time not minimized Lower performance with same number of nodes No compelling reason to choose Hadoop technically 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Why Hadoop? commercially 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Why Hadoop Cheap (Buy more servers to beat DBMS) Flexible (Both in design and deployment) Easier to design Easier to scale up Combine with other system to achieve better performance commercially 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Conclusion • Hadoop is much easier for users to implement and more economic • MapReduce advocates should study the techniques used in parallel DBMSs • Hybrid systems are also popular • With improvement of performance, we believe Hadoop will lead the trend of big data computing 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Reference • http://hadoop.apache.org/ • http://www.runpc.com.tw/content/cloud_content.aspx?id=105318 • http://en.wikipedia.org/wiki/Apache_Hadoo • https://www.facebookbrand.com/ • http://assets.fontsinuse.com/static/use-media-items/15/14246/full-2048x768/522903b7/Yahoo_Logo.png • http://wiki.apache.org/hadoop/PoweredBy • http://semiaccurate.com/assets/uploads/2011/09/Amazon-logo.jpg • http://www.conceptcupboard.com/blog/wp-content/uploads/2013/09/google.jpg 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Reference • http://datashieldcorp.com/files/2013/11/adobe-LOGO-2.jpg • http://upload.wikimedia.org/wikipedia/commons/7/77/The_New_York_Times_logo.png • http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/hadoop-introduction.pdf • http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitecture.pdf • http://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fwww.classcloud.org%2Fcloud%2Fraw-attachment%2Fwiki%2FHinet100402%2F02.HadoopOverview.pdf&ei=IE2XUtLfBMfxiAea_oHQCA&usg=AFQjCNFoIXxLJrOnoul4cKJpQ8v3_kuTYg 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Reference • http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Hadoop-Deployment-Comparison-Study.pdf • https://www.google.com.tw/url?sa=t&rct=j&q&esrc=s&source=web&cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fwww.psgtech.edu%2Fyrgcc%2Fattach%2FMAP%2520REDUCE%2520PROGRAMMING.ppt&ei=7lGXUtvCJsy5iAfWtYH4Bw&usg=AFQjCNGWRKJLal-tvbvORULZV6_Te2y74g&sig2=Ba77ihsV1SEqcNeEFkRzfg • https://www.cs.duke.edu/starfish/files/hadoop-models.pdf • http://dotnetmis91.blogspot.tw/2010/04/hdfs-hadoop-mapreduce.html • http://wiki.apache.org/hadoop/HDFS • http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html 100062108 李智宇、100062116 林威宏、100062220 施閔耀

Reference • http://en.wikipedia.org/wiki/Interpreted_language • A Comparison of Approaches to Large-Scale Data Analysis by Sam Madden • http://www.cc.ntu.edu.tw/chinese/epaper/0011/20091220_1106.htm • http://web.cs.wpi.edu/~cs561/s12/Lectures/6/Hadoop.pdf • http://www.mobilemartin.com/mobile/show-me-the-mobile-money.jpg 100062108 李智宇、100062116 林威宏、100062220 施閔耀

100062108 李智宇、 100062116 林威宏、 1 00062220 施閔耀

100062108 李智宇、 100062116 林威宏、 1 00062220 施閔耀

Presentation Transcript