190 likes | 350 Views
软件体系结构作业 1. 项目调研: Hadoop. 杨晓亮 MG0933047 南京大学计算机科学与技术系 yangxiaoliang2006@gmail.com 2010-3-23. 内容. H adoop 介绍 H adoop 的体系结构 H adoop 的应用. 2014/11/18. 2. 引言. KB 1000 MB 1000,000 GB 1000,000,000 TB 1000,000,000,000 PB 1000,000,000,000,000 … …. 我们今天所要面对的数据量. Google 处理的数据量.
E N D
软件体系结构作业1 项目调研:Hadoop 杨晓亮MG0933047南京大学计算机科学与技术系yangxiaoliang2006@gmail.com2010-3-23
内容 Hadoop介绍 Hadoop的体系结构 Hadoop的应用 2014/11/18 2
引言 • KB 1000 • MB 1000,000 • GB 1000,000,000 • TB 1000,000,000,000 • PB 1000,000,000,000,000 • … … 我们今天所要面对的数据量
Google 处理的数据量 • Google processes over 20 petabytes of data per day(Niall Kennedy,Jan/2008) • 0.34秒…
Google的三个分布式基础设施 • GFS(Google File System) • MapReduce • BigTable • Google先后发表了几篇重要的文章: • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. 2003 • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplied Data Processing on Large Clusters. 2004 • Fay Chang, Jeffrey Dean, Sanjay Ghemawat. Bigtable: A Distributed Storage System for Structured Data. 2006 • Mike Burrows. The Chubby lock service for loosely-coupled distributed systems.
Hadoop 介绍 • Hadoop是 Apache 的一个开源软件项目,由Doug Cutting在2004年开始开发。 • Hadoop是一个海量数据存储和计算的分布式系统,它由若干个成员组成,主要包括:HDFS、MapReduce、HBase、Hive、Pig 和 ZooKeeper, 其中HDFS是Google的GFS开源版本, HBase 是Google的 BigTable开源版本,ZooKeeper是Google的Chubby开源版本。 • Hadoop在大量的公司中被使用和研究
Hadoop 介绍 A9.com - Amazon We build Amazon's product search indices using the streaming API and pre-existing C++, Perl, and Python tools. We process millions of sessions daily for analytics, using both the Java and streaming APIs. 2014/11/18 7
Hadoop 介绍 Yahoo! More than 100,000 CPUs in >25,000 computers running Hadoop Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) Used to support research for Ad Systems and Web Search Also used to do scaling tests to support development of Hadoop on larger clusters 2014/11/18 8
Hadoop 介绍 Facebook We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw storage. Each (commodity) node has 8 cores and 12 TB of storage. 2014/11/18 9
Hadoop 介绍 Baidu - the leading Chinese language search engine Hadoop used to analyze the log of search and do some mining work on web page database handle about 3000TB per week 2014/11/18 10
Hadoop 介绍 在中国,包括中国移动、网易、淘宝、腾讯、金山和华为等众多公司都在研究和使用它 2014/11/18 11
Hadoop 的体系结构 Hadoop由以下几个部件组成: Hadoop Common: The common utilities that support the other Hadoop subprojects. Avro: A data serialization system that provides dynamic integration with scripting languages. Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. HDFS: A distributed file system that provides high throughput access to application data. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper: A high-performance coordination service for distributed applications. 2014/11/18 12
Hadoop 的体系结构-HDFS HDFS的结构按照GFS设计 A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients 2014/11/18 13
Hadoop 的体系结构-HDFS 2014/11/18 14
Hadoop 的体系结构-MapReduce Architecture 2014/11/18 15
Execution Overview Hadoop 的体系结构-MapReduce 2014/11/18 16
MapReduce的数据流程 2014/11/18 17
Hadoop的应用 As of October 2009, commercial applications of Hadoopincluded(from Wiki) Log analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing …… 2014/11/18 18
Referecne http://hpc.cs.tsinghua.edu.cn/dpcourse/index.htm http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Hadoop 2014/11/18 19