1 / 79

雲端運算虛擬技術 -- 雲端計算資料處理技術 -- Hadoop -- MapReduce

雲端運算虛擬技術 -- 雲端計算資料處理技術 -- Hadoop -- MapReduce. 賴智錦 / 詹奇峰 國立高雄大學電機工程學系 2009/08/05. 雲端計算資料處理技術 . What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes: Small data Medium data Large data. Source: http://blog.rgrossman.com/.

mercury
Download Presentation

雲端運算虛擬技術 -- 雲端計算資料處理技術 -- Hadoop -- MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 雲端運算虛擬技術--雲端計算資料處理技術 -- Hadoop-- MapReduce 賴智錦/詹奇峰 國立高雄大學電機工程學系 2009/08/05

  2. 雲端計算資料處理技術 • What is large data? From the point of view of the infrastructure required to do analytics, data comes in three sizes: • Small data • Medium data • Large data Source: http://blog.rgrossman.com/

  3. 雲端計算資料處理技術 • Small data: • Small data fits into the memory of a single machine. • Example: a small dataset is the dataset for the Netflix Prize. (The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences.) • The Netflix Prize dataset consists of over 100 million movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles. • This dataset is just 2 GB of data and fits into the memory of a laptop. Source: http://blog.rgrossman.com/

  4. 雲端計算資料處理技術 • Medium data: • Medium data fits into a single disk or disk array and can be managed by a database. • It is becoming common today for companies to create 1 to 10 TB or large data warehouses. Source: http://blog.rgrossman.com/

  5. 雲端計算資料處理技術 • Large data: • Large data is so large that it is challenging to manage it in a database and instead specialized systems are used. • Scientific experiments, such as the Large Hadron Collider (LHC, the world's largest and highest-energy particle accelerator), produce large datasets. • Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets. Source: http://blog.rgrossman.com/

  6. 雲端計算資料處理技術 • Large data sources: • Most large datasets were produced by the scientific and defense communities. • Two things have changed: • Large datasets are now being produced by a third community: companies that provide internet services, such as search, on-line advertising and social media. • The ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies. Source: http://blog.rgrossman.com/

  7. 雲端計算資料處理技術 • Large data sources: • Two things have changed: • This provides a measure by which to measure the effectiveness of analytic infrastructure and analytic models. • Using this metric, Google settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community. Source: http://blog.rgrossman.com/

  8. 雲端計算資料處理技術 • What is a large data cloud? • A good working definition is that a large data cloud provides • storage services and • compute services that are layered over the storage services that scale to a data center and that have the reliability associated with a data center. Source: http://blog.rgrossman.com/

  9. 雲端計算資料處理技術 • What are some of the options for working with large data? • The most mature large data cloud application is the open source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce. • An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS. Source: http://blog.rgrossman.com/

  10. 雲端計算資料處理技術 • 雲端源自平行運算,但比網格更擅長資料運算 --中研院網格計算(ASGC) 主持人林誠謙博士 • 雲端運算源自平行運算的技術,不脫離網格運算的哲學,但是雲端運算更專注在資料的處理 • 單次資料處理量小,讓雲端運算發展出不同於網格運算的實作方式 --國家高速網路與計算中心企業與計畫管理組計畫主持人黃維誠博士

  11. 雲端計算資料處理技術 • 雲端運算適合的任務,多半是資料處理次數頻率高,而每一次要處理的資料量小。 --國家高速網路與計算中心企業與計畫管理組計畫主持人黃維誠博士 資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2

  12. 雲端計算資料處理技術 資料來源:iThome整理,2008年6月

  13. 雲端計算資料處理技術 • 搜尋網頁: 每一次要比對的網頁,其實檔案都不大,所需耗費的處理器資源不多,所以用大量的個人電腦就可以來執行網頁搜尋的運算,但是,要用個人電腦來架設網格運算就比較難,因為網格運算所需的處理資源較大。 • 實作的差異就是,雲端運算可以組合大量的個人電腦來提供服務,而網格運算則需要依賴能提供大量運算資源的高效能電腦。 --國家高速網路與計算中心企業與計畫管理組計畫主持人黃維誠博士 資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2

  14. 雲端計算資料處理技術 • 雲端運算(Cloud Computing):Google提出的分散式運算技術,讓開發人員很容易開發出全球性的應用服務,雲端運算技術可以自動管理大量標準化(非異質性)電腦間的溝通、任務分配和分散式儲存等。 • 網格運算(Grid Computing):在網路上,透過標準化協定與信任機制,整合跨網域中的異質伺服器,建立運算叢集系統來共享運算資源、儲存資源等。 • 服務在雲端(In-the-Cloud)或雲端服務(Cloud Service):供應商透過網際網路提供服務,使用者只需透過瀏覽器就能使用,不需了解供應商的伺服器如何運作。

  15. 雲端計算資料處理技術 • MapReduce模式:Google運用在雲端運算中的關鍵技術,讓開發者開發大量資料的處理程式。先透過Map程式將資料切割成不相關的區塊,分配給大量電腦處理,再透過Reduce程式將結果彙整,輸出開發者需要的結果。 • Hadoop:使用Java開發的開源雲端運算框架,也是採用Google雲端運算技術實作的框架,但所用的分散式檔案系統與Google不同。2006年Yahoo成為該計畫最主要的貢獻者和使用者。 資料來源: http://www.ithome.com.tw/itadm/article.php?c=49410&s=2

  16. 雲端計算資料處理技術 • 資料處理:是指當有大量資料時,如何進行平行化的切割,計算,合併,以便讓處理人員可以直接獲取這些大量資料的總結。 • 平行化資料分析語言: • Google的 Sawzall 專案與Yahoo的 Pig 專案,都是屬於平行處理大量資料的高階程式語言。 • Google的 Sawzall 建制於MapReduce之上,Yahoo的 Pig 建於Hadoop之上(Hadoop為MapReduce的clone),兩者幾乎系出同門。

  17. Hadoop: Why? • Need to process 100TB datasets with multi-day jobs • On 1-node: • Scanning at 50 MB/s = 23 days • On 1000 node cluster: • Scanning at 50 MB/s = 33 min • Need framework for distribution • Efficient, reliable, usable

  18. Hadoop: Where? • Batch data processing, not real-time/user facing • Log Processing • Document Analysis and Indexing • Web Graphs and Crawling • Highly parallel, data intensive, distributed applications • Bandwidth to data is a constraint • Number of CPUs is a constraint • Very large production deployments (GRID) • Several clusters of 1000s of nodes • LOTS of data (Trillions of records, 100 TB+ data sets)

  19. What is Hadoop? • The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. • The project includes: • Core: provides the Hadoop Distributed Filesystem (HDFS) and support for the MapReduce distributed computing framework. • MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines. • Chukwa: a data collection system for managing large distributed systems. Chukwa is built on top of the HDFS and MapReduce framework and inherits Hadoop's scalability and robustness.

  20. What is Hadoop? • HBase: builds on Hadoop Core to provide a scalable, distributed database. • Hive: a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets. • Pig: a high-level data-flow language and execution framework for parallel computation. It is build on top of Hadoop Core. • ZooKeeper: a highly available and reliable coordinate system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.

  21. Hadoop History • 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella • December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006 - Doug Cutting joins Yahoo! • February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. • March 2006 - Formation of the Yahoo! Hadoop team • April 2006 - Sort benchmark run on 188 nodes in 47.9 hours

  22. Hadoop History • May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes • May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) • October 2006 - Research cluster reaches 600 Nodes • December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 hrs • April 2007 - Research clusters - 2 clusters of 1000 nodes Source: http://hadoop.openfoundry.org/slides/Hadoop_OSDC_08.pdf

  23. Hadoop Components • Hadoop Distributed Filesystem (HDFS) • is a distributed file system designed to run on commodity hardware. • is highly fault-tolerant and is designed to be deployed on low-cost hardware. • provides high throughput access to application data and is suitable for applications that have large data sets. • relaxes a few POSIX requirements to enable streaming access to file system data. (POSIX: Portable Operating System Interface [for Unix"]) • was originally built as infrastructure for the Apache Nutch web search engine project. • is part of the Apache Hadoop Core project.

  24. Hadoop Components • Hadoop Distributed Filesystem (HDFS) Source: http://hadoop.apache.org/core/

  25. Hadoop Components • HDFS Assumptions and Goals • Hardware failure • Hardware failure is the norm rather than the exception. • An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. • There are a huge number of components and each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. • Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

  26. Hadoop Components • HDFS Assumptions and Goals • Streaming Data Access • Applications that run on HDFS need streaming access to their data sets. • They are not general purpose applications that typically run on general purpose file systems. • HDFS is designed more for batch processing rather than interactive use by users. • The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS.

  27. Hadoop Components • HDFS Assumptions and Goals • Large Data Sets • Applications that run on HDFS have large data sets. • A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

  28. Hadoop Components • HDFS Assumptions and Goals • Simple Coherency Model • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. • A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

  29. Hadoop Components • HDFS Assumptions and Goals • Simple Coherency Model • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. • A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

  30. Hadoop Components • HDFS Assumptions and Goals • "Moving Computation is Cheaper than Moving Data" • A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. • This minimizes network congestion and increases the overall throughput of the system. • It is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

  31. Hadoop Components • HDFS Assumptions and Goals • Portability Across Heterogeneous Hardware and Software Platforms • HDFS has been designed to be easily portable from one platform to another. • This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

  32. Hadoop Components • HDFS: Namenode and Datanode • HDFS has a master/slave architecture • An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. • In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. • HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

  33. Hadoop Components • HDFS: Namenode and Datanode • HDFS has a master/slave architecture • The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. • The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. • The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

  34. Hadoop Components • Hadoop Distributed Filesystem (HDFS) Source: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

  35. Hadoop Components • HDFS: The File System Namespace • HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. • The file system namespace hierarchy is similar to most other existing file systems. • one can create and remove files, move a file from one directory to another, or rename a file. • The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. • An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

  36. Hadoop Components • Hadoop Distributed Processing Framework • Using MapReduce Metaphor • Map/Reduce is a software framework for easily writing applications which process vast amounts of data in-parallel on large clusters of commodity hardware. • A simple programming model that applies to many large-scale computing problems • Hide messy details in MapReduce runtime library: • Automatic parallelization • Load balancing • Network and disk transformation optimization • Handling of machine failures • Robustness

  37. Hadoop Components • A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. • The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. • The Map/Reduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. • The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. • The slaves execute the tasks as directed by the master.

  38. Hadoop Components • Although the Hadoop framework is implemented in JavaTM, Map/Reduce applications need not be written in Java. • Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. • Hadoop Pipes is a SWIG- compatible C++ API to implement Map/Reduce applications (non JNITM [Java Native Interface] based).

  39. Hadoop Components • MapReduce concepts • Definition: • Map function: Take a set of (key, value) pairs and generate a set of intermediate (key, value) pairs by applying some function to all these pairs. Eg., (k1, v1)  list(k2, v2) • Reduce function: Merge all pairs with same key applying a reduction function on the values. E.g., (k2, list(v2))  list(k3, v3) • Input and Output types of a Map/Reduce job: • Read a lot of data • Map: extract something meaningful from each record • Shuffle and Sort • Reduce: aggregate, summarize, filter, or transform • Write the results (input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)

  40. Map Reduce Shuffle & Sort Output Input the fox ate the mouse the, 1 fox, 1 the, 1 the, 3 fox, 2 brown, 1 small, 1 Map ate, 1 mouse, 1 Reduce small, 1 the small mouse the, 1 mouse, 1 Map the, 1 ate, 1 mouse, 2 quick, 1 the, 1 brown, 1 fox, 1 Reduce quick, 1 the quick brown fox Map Hadoop Components • MapReduce concepts

  41. Hadoop Components • Consider the problem of counting the number of occurrences of each word in a large collection of documents: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); The map function emits each word plus an associated count of occurrences ("1" in this example). The reduce function sums together all the counts emitted for a particular word.

  42. Hadoop Components • MapReduce Execution Overview 1. The MapReduce library in the user program first shards the input files into M pieces of typically 16-64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. 3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined map function. The intermediate key/value pairs produced by the map function are buffered in memory. Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.

  43. Hadoop Components • MapReduce Execution Overview 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. 5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used. Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.

  44. Hadoop Components • MapReduce Execution Overview 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's reduce function. The output of the reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.

  45. Hadoop Components • MapReduce Examples • Distributed Grep (global search regular expression and print out the line): • The map function emits a line if it matches a given pattern. • The reduce function is an identity function that just copies the supplied intermediate data to the output. • Count of URL Access Frequency: • The map function processes logs of web page requests and outputs <URL, 1>. • The reduce function adds together all values for the same URL and emits a <URL, total count> pair.

  46. Hadoop Components • MapReduce Examples • Reverse Web-Link Graph: • The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". • The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. • Inverted Index: • The map function parses each document, and emits a sequence of <word, document ID> pairs. • The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. • The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

  47. Hadoop Components • MapReduce Examples • Term-Vector per Host:A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. • The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). • The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair.

  48. MapReduce Programs in Google's Source Tree • New MapReduce Programs per Month Source: http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf

  49. Who Uses Hadoop • Amazon/A9 • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet (now Microsoft) • Quantcast • Veoh • Yahoo! • More at http://wiki.apache.org/hadoop/PoweredBy

  50. Hadoop Resource • http://hadoop.apache.org • http://developer.yahoo.net/blogs/hadoop/ • http://code.google.com/intl/zh-TW/edu/submissions/ uwspr2007_clustercourse/listing.html • http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 • J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, 51(1):107-113, 2008. • T. White, Hadoop: The Definitive Guide (MapReduce for the Cloud), O'Reilly, 2009.

More Related