990 likes | 1.19k Views
声明. 本课件仅用于北京航空航天大学计算机学院的教学; 本课件修改采用了一些网络资源(论文、研究报告、技术报告等),在采用的时候并没有准确标注引用信息。. 补充内容:对等计算. 内容提纲. 1. 来源背景 2.P2P 文件共享系统 3.P2P 的学术研究 4.P2P 应用 5.“ 主动对等计算”. 带宽高 机器 (PC) 能力提升,价格降低 网络范围广 网络上的资源数量巨大,类型丰富. 结构复杂 动态性强 社会因素多 安全性 可靠性 可用性 效率 …. 目前的网络计算环境. 分布式计算技术多样化.
E N D
声明 • 本课件仅用于北京航空航天大学计算机学院的教学; • 本课件修改采用了一些网络资源(论文、研究报告、技术报告等),在采用的时候并没有准确标注引用信息。
内容提纲 • 1.来源背景 • 2.P2P文件共享系统 • 3.P2P的学术研究 • 4.P2P应用 • 5.“主动对等计算”
带宽高 机器(PC)能力提升,价格降低 网络范围广 网络上的资源数量巨大,类型丰富 结构复杂 动态性强 社会因素多 安全性 可靠性 可用性 效率 … 目前的网络计算环境
分布式计算技术多样化 • 对等计算(Peer to Peer, P2P) • 数据资源 • Grid • 数据资源,尤其是文件 • 计算资源 • 软件资源 • Web Service • 软件资源 • WWW, ebXML, …
Term: Peer to Peer • P2P以独有的特点被广泛关注 • 不同于C/S的计算模式,P2P更有利于充分利用PC机上的资源,方便更多的普通人参与 • 在这之前,主流的应用模式是客户机/服务器模式,在这种模式中,资源主要存在于服务器中,供客户机下载,服务器是中心,向多个客户机提供服务。
在P2P模式中,任何两个节点之间都是对等的,节点既是客户机也是服务器,既提供资源,也消费资源。在P2P模式中,任何两个节点之间都是对等的,节点既是客户机也是服务器,既提供资源,也消费资源。 • 节点数量多,互相之间并不“认识” • 动态性更强
Term: Overlay Network • Overlay networks is a term for networks that run on top of an existing infrastructure but provide certain additional unctionality.”’ (Source: www.overlay-networks.info) • introduce a new structure of who is connected with whom • This new structure uses the connectivity of the lower layers for its links. The layer below the overlay is called underlay.
The most well-known overlay is the Internet Protocol (IP) itself. • abstracts from the technology and physical location present in the lower layers (Medium access, Physical Layer). • forms a structure that is optimized to provide connectivity between all connected networks and their entities, no matter where they are and what access technology they use.
1.mp3 2.mp3 3.mp3 …. a.mpeg b.mpeg e.mpeg …. P2P的起源 6.mp3 5.mp3 e.mpeg f.mpeg 1.mp3 2.mp3 3.mp3 4.mp3 3.mp3 a.mpeg b.mpeg 11.mp3 13.mp3 h.mpeg i.mpeg
演变: 纯无结构 Random Network
演变:结构化 b a c … d h e g f
2. P2P文件共享系统 1999 2000 2001 2002 … Napster Gnutella FastTrack LimeWire iMesh&Grokster Morpheus Kazaa eDonkey OverNet eDonkey2000 DC++ BitTorrent eXeem
其中一个中心服务器上有45万用户,6137万个文件 103万OverNet用户 同时连接到DHT的OverNet网络和一个中心服务器模式的网络
P2P的节点离散分布在物理网络的不同地方,P2P通过应用层的网络,即覆盖网(Overlay Network),将这些节点连接起来。 节点在覆盖网上通信,实现信息资源的检索和获取。 根据覆盖网的结构以及信息检索方式的不同,可以将P2P系统分成无结构的P2P(Unstructed P2P),有结构的P2P(Structured P2P),分层的P2P (Hierarchical P2P), …… P2P文件共享系统分类
2.1 中心式 • E.g., Napster (1999) • Lookups via central server, which holds index of all data items • IP address of the server must be known • Only file retrieval and storage are decentralized • Problems: • Single point of failure • Server presents a bottleneck • Scalability issues
2.2 “纯”无结构 • E.g., FreeNet and Gnutella 0.4 • No central entity, except for bootstrap server • Search requests are flooded to (potentially all) nodes • Problems: • Significant traffic overhead • Topological mismatch between overlay and physical network causes zigzag routes id
Unstructured P2P evaluation • Pro’s • Easy to search by keywords & RE (*britney* will easily find britney_spears.mp3) • Maintenance costs low (adding or removal of peers) • Easy to deploy
Unstructured P2P evaluation • Con’s • Broadcast generates lots of traffic → scales bad → maximum search depth → not all nodes are reached → not all matches are found
2.3 结构化 • DHTs (Distributed Hash Tables) can be used to distribute data items across nodes in a deterministic manner • This allows the data to be found again easily • Hash Functions:Functions that are applied to a (long) message to produce a unique (short) ‘fingerprint’ of the message
Given M, it is easy to compute h • Can process long messages quickly • Given h, it is hard to compute M such that H(M) = h • Cannot manufacture a message that will hash to a particular value • Given M it is hard to find another message M’, such that H(M) = H(M’) • Cannot derive two messages that hash to the same value - collision free • Popular hash functions • Message Digest 5 (MD5) uses 128 bits • Secure Hashing Algorithm 1 (SHA1) uses 160 bits
Each data item is equipped with a key (a hash value) • Each node is equipped with a key in the same range
Key space partitioning • The key space is the size of the range of the hash function (e.g., 2128for MD5 or 2160 for SHA1) • The key space is partitioned into contiguous segments • Each node is responsible for a certain segment
Each node knows a (small) number of neighbours. • Partial view of the system: O(log N) • This allows them to route search requests (for data item D) towards nodes closer (in terms of identifer values) to the destination: O(log N)
Storing data items in DHTs • When a data item is stored into a DHT, a hash function calculates the item’s key • Can be based on file name, meta data, file content, etc. • The data item is then forwarded to the node responsible for that key • Complexity: O ( log N ) where N is the number of nodes in the network
When a node receives a query identifying a data item it checks… • If it’s responsible for the data item, it sends the item back (or a failure message if the item doesn’t exist) • If it’s not responsible, it forwards the query to one of its neighbours • A typical routing metric is that of numerical distance • Messages are forwarded to neighbours whose keys are numerically closest to the key in the request
两大类方法的比较 • 非结构化的Gnutella网络更适合查找热门资源,而且统计表明人们的行为也是这样的,所以其运作得非常好。 • 结构化的P2P网络定位快速,所以可以快速的定位资源。但是问题在于:首先冷门资源的文件数目也同样巨大,在Churn高的网络上维护这么多key开销也很大。OverNet也是用来维护热门资源,所以DHT和Gnutella相比的优势何在?
2.4 Hybrid P2P (Super Node) • E.g., Gnutella 0.6, KaZaA • Additional dynamic hierarchy: Super Peers • Hub-based network • Reduced message load • Leaf nodes announce their shared content with the superpeer they are connected to
Gnutella(0.6 ? ) Bootstrapping • “Some peers are more equal than others” (G. Orwell) Server
Gnutella List of shared files
Gnutella • Supernodes form pure P2P network • Breadth first search (depth = 2 → TTL = 2) • Overhead O(degree x N) Hit! Search message: Keyword “britney” Nodes not reached
2.5 Semantic Overlay Network • Broadcasting all queries to all information sources obviously doesn‘t scale efficiently • Hash-based queries scale, but can’t be complex • no approximate queries • no range queries • no text queries • Arising intuitions: • It’s better to route queries only to peers that are more likely to have answers • Shared content often has pronounced ontological structure (music, movies, scientific papers etc.)
Conceptual idea • Peers are clustered • Clusters overlap • Query is distributed to relevant cluster(s) only • Query is routed within each relevant cluster only • Irrelevant clusters are not bothered with the query
Semantic Overlay Network (SON) Semantic Overlay Network Virtual, abstract, independent layer of selected peers Advantages • Introduces semantic “views” to the physical network • Mediation and integration (correspondences, query rewriting) • Reduces overfloodding the network
rock jazz country Formal definition of SON Semantic Overlay Network (SON) is a set of triples (links): {(ni ,nj ,L)} ni ,nj - linked peers L - string (name of category) Each SONL implements functions: • Join (ni) • Search (q) • Leave (ni)
SON: Hierarchical definiton • SON is an overlay network, associated with a concept of classification hierarchy • For example, we have 9 SONs for classification of music by style or 4 SONs for classification of music by tone • Documents of the peer must be assigned to concepts, so that this peer could be assigned to corresponding SONs
Criteria of good classification hierarchy Classification hierarchy is good, if • Documents in each category belong to a small number of peers (high granularity + equal popularity) • Peers have documents in a small number of categories (sensible, moderate granilarity) • Classification algorithm is fast and errorless
Sources of errors • Format of the files may not follow the expected standard • Classification ontology may be incompatible with files • Users make misspellings in the names of files • So, 25% of music files were classified incorrectly • But peer can still be correctly classified even if some of its documents are misclassified! • So, only 4% of peers were classifed incorrectly
Peer assignment strategies • Conservative strategy: place peer in SONc,if it has any document classified in concept c produces too many links • Less conservative strategy: place peer in SONc,if it has “significant” number of documents, classified in concept c prevents from finding all documents Final solution: use Layered SONs
Layered SONs: Overview ≥ 15 % Hierarchy of concepts I. Apply less conservative strategy with threshold parameter II. Consider combination of “non-assigned” concepts, try to join peer to “upper-level” SON
c b query query query Layered SONs: Searching • Query can be assigned to: • Leaf concept, i.e. precisely classified (figure a) • Non-leaf concept, i.e. imprecisely classified (figures b, c) • Imprecise classification leads to additional overhead a
几类P2P文件共享系统小结 • Centralized Unstructed P2P: Overlay结构和信息资源的存放与Pure Unstructed P2P类似,同时存在一个有中心服务器,用来保存所有信息资源的索引信息。因此存在单点失效风险[2]。典型的例子是Napster [7]。 • Pure Unstructed P2P: P2P overlay通常是random或无结构的mesh,信息资源可能存放在任何节点上。信息资源的检索采用泛洪或者随机漫步算法,检索的效率较低。典型的系统有Gnutella [4], Freenet[5], and Morpheus [6]等。
Structured P2P:Overlay通常采用诸如mesh, ring, d-dimension torus, and butterfly这样的拓扑结构[2],文件的存放位置和Overlay结构紧密相关。信息资源的检索采用分布式哈希表( distributed hash table, DHT)技术。典型的系统有 Chord[10] , Pastry[9], Tapestry[8] , and CAN[11]等。. • Hierarchical P2P: 节点被分成互相连接的上下两层,上层是能力高并且相对稳定的节点,称为Super Peer,下层是普通结点[12]。典型的系统是KaZaA[13].
Semantic P2P:具有相似内容的结点连接到相同的super peer node[14] [15];节点之间的连接受内容的影响,因此,具有许多类型如“Jazz”的文件的节点将连接到其它类似的节点[16];And Shortcuts connect peers that share similar interests and thus spontaneously form semantic communities [17]。
3. 相关研究问题 • 过去几年里,顶级的网络和系统会议都有P2P和Overlay Networking • ACM SIGCOMM • ACM PODC • IEEE ICNP • IEEE InfoCOM • IEEE ICDCS • IEEE P2P • IPTPS • OSDI • SOSP • …
Overlay (2002-) Locality Aware, Topology Aware Self Configurable End-to-End Addressability Searching, Locating, Routing (2001-2004) Transport, Downloads (2003 - 2006) Semantic (2003- 2006) Measurement(2001-2004) Reliability, Resilience, Efficient, Availability, Load Balancing(2003-) Churn Reputation, Fair, Incentives, Free-riding, Economy (2004-2006) Security, Trust (2002-) Application(2002-2005) Multicast Group Communication Data Management P2P的研究