300 likes | 436 Views
100062108 李智宇、 100062116 林威宏、 1 00062220 施閔耀. Outline. Introduction Architecture of Hadoop HDFS MapReduce Comparison Why Hadoop Conclusion. What is Hadoop ?. open-source software framework process and store big data Easy to use and implement, economic, flexible
E N D
100062108 李智宇、 100062116 林威宏、 100062220 施閔耀
Outline • Introduction • Architecture of Hadoop • HDFS • MapReduce • Comparison • Why Hadoop • Conclusion 100062108 李智宇、100062116 林威宏、100062220 施閔耀
What is Hadoop ? • open-source software framework • process and store big data • Easy to use and implement, economic, flexible • lots of nodes(server) • written in JAVA • free license • created by Doug Cutting and Mike Cafarella in 2005 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Advantages of Interpreted Language • Cross-platform(ex: Windows, Ubuntu, Mac OS X) • smaller executable program size • easier to modify during both development and execution 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Architecture of Hadoop 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Hadoop in Enterprise The Dell representation of the Hadoop ecosystem. 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Hadoop in Enterprise 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Who is using Hadoop ? more than half of the Fortune 50 uses Hadoop by 2013 100062108 李智宇、100062116 林威宏、100062220 施閔耀
HDFS • Hadoop Distributed File System • Client: user • name node: manage and store metadata, namespace of files • Data node: store files • each data node sends its status to name node periodically 100062108 李智宇、100062116 林威宏、100062220 施閔耀
HDFS: Writing data in HDFS • Each file will be divided into blocks(in size 64 or 128MB) , and have three copies in different data nodes. • Client asks name node to get a list of data node sorted by distance, and send the file to the nearest one , then the data node will send the file to the rest node. • When above operation done, data node will send “done” to name node. 100062108 李智宇、100062116 林威宏、100062220 施閔耀
HDFS: Reading data in HDFS • Client send filename to the name node , then the name node will send a list of the blocks of files sorted by distance. • Client use the list to get the file from data node. 100062108 李智宇、100062116 林威宏、100062220 施閔耀
HDFS: failure • node failure • communication failure • data corruption 100062108 李智宇、100062116 林威宏、100062220 施閔耀
HDFS: handle failure • Handle writing failure:name node will skip the data node without an ACK. • Handle reading failure:recall that when reading a file, client will get a list of data node content the file. 100062108 李智宇、100062116 林威宏、100062220 施閔耀
HDFS: handle failure • Name node handle node failure : name node will find out the data the failure node have, and copy those data from others and restore them to other data node. • Note that HDFS can’t guarantee at least one copy of data is alive. 100062108 李智宇、100062116 林威宏、100062220 施閔耀
MapReduce • similar to divide-and-conquer • First, use “Map” to divide tasks • Second, use “Shuffle” to “transfer the data from the mapper nodes to a reducer’s node and decompress if needed. “ • Third, use “Reduce” to “execute the user-defined reduce function to produce the final output data. “ 100062108 李智宇、100062116 林威宏、100062220 施閔耀
MapReduce-Map 100062108 李智宇、100062116 林威宏、100062220 施閔耀
MapReduce-shuffle 100062108 李智宇、100062116 林威宏、100062220 施閔耀
MapReduce-Reduce 100062108 李智宇、100062116 林威宏、100062220 施閔耀
MapReduce 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Comparison 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Comparison 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Why Hadoop? technically Comparison of Grep Task Result with Vertica and DBMS-X 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Why Hadoop? Simple structure vs. Optimization Transaction time not minimized Lower performance with same number of nodes No compelling reason to choose Hadoop technically 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Why Hadoop? commercially 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Why Hadoop Cheap (Buy more servers to beat DBMS) Flexible (Both in design and deployment) Easier to design Easier to scale up Combine with other system to achieve better performance commercially 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Conclusion • Hadoop is much easier for users to implement and more economic • MapReduce advocates should study the techniques used in parallel DBMSs • Hybrid systems are also popular • With improvement of performance, we believe Hadoop will lead the trend of big data computing 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Reference • http://hadoop.apache.org/ • http://www.runpc.com.tw/content/cloud_content.aspx?id=105318 • http://en.wikipedia.org/wiki/Apache_Hadoo • https://www.facebookbrand.com/ • http://assets.fontsinuse.com/static/use-media-items/15/14246/full-2048x768/522903b7/Yahoo_Logo.png • http://wiki.apache.org/hadoop/PoweredBy • http://semiaccurate.com/assets/uploads/2011/09/Amazon-logo.jpg • http://www.conceptcupboard.com/blog/wp-content/uploads/2013/09/google.jpg 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Reference • http://datashieldcorp.com/files/2013/11/adobe-LOGO-2.jpg • http://upload.wikimedia.org/wikipedia/commons/7/77/The_New_York_Times_logo.png • http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/hadoop-introduction.pdf • http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitecture.pdf • http://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fwww.classcloud.org%2Fcloud%2Fraw-attachment%2Fwiki%2FHinet100402%2F02.HadoopOverview.pdf&ei=IE2XUtLfBMfxiAea_oHQCA&usg=AFQjCNFoIXxLJrOnoul4cKJpQ8v3_kuTYg 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Reference • http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Hadoop-Deployment-Comparison-Study.pdf • https://www.google.com.tw/url?sa=t&rct=j&q&esrc=s&source=web&cd=1&ved=0CCkQFjAA&url=http%3A%2F%2Fwww.psgtech.edu%2Fyrgcc%2Fattach%2FMAP%2520REDUCE%2520PROGRAMMING.ppt&ei=7lGXUtvCJsy5iAfWtYH4Bw&usg=AFQjCNGWRKJLal-tvbvORULZV6_Te2y74g&sig2=Ba77ihsV1SEqcNeEFkRzfg • https://www.cs.duke.edu/starfish/files/hadoop-models.pdf • http://dotnetmis91.blogspot.tw/2010/04/hdfs-hadoop-mapreduce.html • http://wiki.apache.org/hadoop/HDFS • http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html 100062108 李智宇、100062116 林威宏、100062220 施閔耀
Reference • http://en.wikipedia.org/wiki/Interpreted_language • A Comparison of Approaches to Large-Scale Data Analysis by Sam Madden • http://www.cc.ntu.edu.tw/chinese/epaper/0011/20091220_1106.htm • http://web.cs.wpi.edu/~cs561/s12/Lectures/6/Hadoop.pdf • http://www.mobilemartin.com/mobile/show-me-the-mobile-money.jpg 100062108 李智宇、100062116 林威宏、100062220 施閔耀