260 likes | 354 Views
“A Local Search Mechanism for Peer-to-Peer Networks”. Vana Kalogeraki, Dimitrios Gunopulos & Demetris Zeinalipour (University of California – Riverside) < vana@cs.ucr.edu , dg@cs.ucr.edu , csyiazti@cs.ucr.edu >.
E N D
“A Local Search Mechanism for Peer-to-Peer Networks” Vana Kalogeraki, Dimitrios Gunopulos & Demetris Zeinalipour (University of California – Riverside) < vana@cs.ucr.edu, dg@cs.ucr.edu, csyiazti@cs.ucr.edu > CIKM 2002 – Eleventh International Conference on Information and Knowledge Management November 4-9, Mclean VA http://www.cs.ucr.edu/~csyiazti/publications.html
Presentation Outline • Introduction: Information Retrieval (I.R) in Peer-to-Peer networks. • Techniques for Distributed I.R. • Breadth-First Search. • Random Breadth-First Search. • Intelligent Search with profiling. • Experimental Evaluation. • Related Work. • Conclusions & Future Work.
The virtual P2P topology The physical topology Introduction to Peer-to-Peer • Peer-to-Peer Computing definition: “Sharing of computer resources and information through direct exchange” • Clients (downloaders) are also servers • Clients may join or leave the network at any time => highly fault-tolerant but with a cost! • Searches are done within the virtual network while actual downloads are done offline (with HTTP).
Introduction to Peer-to-Peer • Peer-to-Peer (P2P) systems are increasingly becoming popular. • P2P file-sharing systems, such as Gnutella, Napster and Freenet realized a distributed infrastructure for sharing files. • Traditionally, files were shared using the Client-Server model (e.g. http). Not scalable since they are centralized services. • P2P uncover new advantages in simplicity of use, robustness, self organization and scalability.
keywords Information Retrieval in P2P Problem: “How to efficiently retrieve Information in P2P systems where each node shares a collection of documents?” • Documents consists of keywords. • Resembles Information Retrieval but resources are distributed now. • Primary Data Structures such as Global Inverted Indexes can’t be maintained efficiently.
Solutions for P2P Information Retrieval 1) Centralized Approaches • Centralized Indexes • e.g. Napster, SETI@HOME 2) Purely Distributed Approaches • Each node has only local knowledge. • I.R is done using Brute force mechanisms • e.g. Gnutella, Fasttrack (Kazaa) 3) Hybrid Approaches • One or more peers have partial indexes of the contents of others. • e.g. Limewire's Ultrapeers Centralized Index 1) Upload Index 2) Query/QueryHit 3) Download (offline) 1 2 3 1) Connect 2) Query/QueryHit 3) Download (offline) 1,2 3 1) Connect 2) IntelligentQuery/QueryHit 3) Download (offline) 1,2 3
Motivation • On 1st June we crawled the Gnutella P2P Network for 5 hours with 17 workstations. • We analyzed 15,153,524 query messages. • Observation: High locality of specific queries. • We try to exploit this property for more efficient searches?
Presentation Outline • Introduction: Information Retrieval (I.R) in Peer-to-Peer networks. • Techniques for Distributed I.R. • Breadth-First Search. • Random Breadth-First Search. • Intelligent Search with profiling. • Experimental Evaluation. • Related Work. • Conclusions & Future Work.
Techniques for Distributed I.R. • Breadth-First Search (Gnutella) • Each Query Message is propagated along all outgoing links of a peer using TTL (time-to-live). • TTL is decremented on each forward until it becomes 0 • Technique for I.R in P2P systems such as Gnutella. • Results? • The physical network comes to its knees • Long Delays for search results. P2P Network N A QUERY 1 QUERYHIT 2 Peer q Peer d
Peer q Techniques for Distributed I.R. 2. Modified Random BFS • Each Query Message is forwarded to only a fraction of outgoing links (e.g. ½ of them). • TTL is again decremented on each forward until it becomes 0. • Results? • Fewer Messages but possibly less results • This algorithm is probabilistic. • Some segments may become unreachable unreachable B A QUERY 1 P2P Network N QUERYHIT C 2 Peer d
Peer q Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM) • Idea: Each Query Message is forwarded intelligently based on what queries a peer answered in the past. • Components of ISM (for each node u) • Profile Mechanism, for eachneighborN(u). • Peer Ranking Mechanism, for ranking peers locally and send a search query only to the ones that most likely will answer. • Similarity Function, for finding similar search queries. • Search Mechanism, for propagating queries based on local indexes A QUERY 1 profiles QUERYHIT 2 ? Peer d
Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM) a) Profile mechanism. • Maintains a list of past queries routed through that host. • Every time a QueryHit is received the table is updated • The profile manager uses a Least Recently Used policy to keep most recent queries in repository. • Profiles are kept for neighbors only so the cost for maintaining this cost is O(Td),Tis a limiting factor per profile, dis the degree of a node Size: T*d }
Example Assume host Pkneeds to forward a query q=“italy disaster” to two of its peers {P1, P2, P3}.Pkmaintains queries {q1 ,q2,. ,q5}in its profile. => PsimP(P1, q) = 0.81 = 0.8 P1 Sim(q, q1) = 0.8 Sim(q, q2) = 0.6 Sim(q, q3) = 0.5 Sim(q, q4) = 0.4 Sim(q, q5) = 0.4 P2 { } => PsimP(P2, q) = 0.61 + 0.51 = 1.1 P3 { } => PsimP(P3, q) = 0.41 + 0.31 = 0.7 Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM) b) Peer Ranking Mechanism. • Before forwarding a Query Message a peer performs an on-the-fly ranking of its peers to determine the best paths. • We use the Aggregate Similarity of peer Pi to a query q, computed by a peer Pk as:
Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM) c) Similarity Function – The cosine similarity. • Assume that Lis a set of all words (in Profile Manager)\ e.g. L={elections, bush, clinton, super, bowl, san, diego, … ,italy, earthquake, disaster} • We define an |L|-dimensional space where each query is a vector. If q=“italy disaster” => q (vector of q) = [0,0,0,…,1,0,1] • Recall that we have a vector for each qi stored in the Profile Manager ( i.e. qi)
Peer q Techniques for Distributed I.R. 3. Intelligent Search Mechanism (ISM) d) Search Mechanism • Utilizes the Peer Ranking Mechanism to forward Queries to nodes that will potentially contain the info we are looking for Peer d profiles ? QUERY 1 ?
Presentation Outline • Introduction: Information Retrieval (I.R) in Peer-to-Peer networks. • Techniques for Distributed I.R. • Breadth-First Search. • Random Breadth-First Search. • Intelligent Search with profiling. • Experimental Evaluation. • Related Work. • Conclusions & Future Work.
mexico Data-Peer (e.g. usa) argentina Routing Structures (Profiles) u.k china italy XQL PDOM-XML P2P Network Manager Module india france greece germany usa.graph XML Data Files Experimental Evaluation • We use a decentralized Newspaper application built on top of the REUTERS dataset (22,531 documents grouped by 84 countries). • Random Network of 100 peers • Each peer has documents from 3 countries • The average degree of a node is 7 ~= log2100 (connected graph)
Experimental Evaluation • We perform 400 sequential queries with a delay of 4 sec. • We compare Doc. Ratio (recall rate) vs. Num. of messages • BFS (Gnutella Message Flooding) (forward to degree nodes). • Modified BFS (randomly forward to degree/2 nodes). • Intelligent Search Mechanism (forward to M=3 highest rank nodes + 1 random).
Experimental Evaluation • We measure Doc. Ratio (recall rate) vs. Num. of messages with Time-to-Live (TTL)=4 • BFS (Gnutella) uses ~763 messages w/ recall rate 100% • Random BFS(degree/2) uses ~120 (16%) msgs w/ recall rate 42% • Intelligent Search uses ~131 (17%) msgs w/ recall rate ~55% • Recall Rate improves over time with Intelligent Search since Peer Profiles get more knowledge.
Experimental Evaluation • We again measure Doc. Ratio (recall rate) vs. Num. of messages by increasing Time-to-Live (TTL) = 5 • BFS (TTL=4) uses ~763 messages w/ recall rate 100% • Random BFS(degree/2) uses ~28% msgs w/ recall rate ~72% • Intelligent Search uses ~35%(of BFS msgs) w/ recall rate ~90% ! • A large number of peers receive unnecessary messages. • We get almost identical recall (90%) with only 35% of msgs
Presentation Outline • Introduction: Information Retrieval (I.R) in Peer-to-Peer networks. • Techniques for Distributed I.R. • Breadth-First Search. • Random Breadth-First Search. • Intelligent Search with profiling. • Experimental Evaluation. • Related Work. • Conclusions & Future Work.
Related Work • Improving Search in P2P B.Yang et al. (Stanford) • Iterative Deepening, until Z results are returned • Directed BFS based on aggregate statistics (e.g. num of results a peer returned, shortest queue, forwarded the most data) • Local Indexes, each node maintains an index over the data of peers r hops away. • Routing Indices for P2P Crespo et al. (Stanford) • Compound Indices, each node sends a clustered summary of its topic to its neighbors. (e.g. 100 databases, 4 theory, 10 OS) • Might be too costly for Highly dynamic P2P systems.
Related Work • Freenet (Clark et al.) Search by Identifiers. uses SHA1 hashes of resources and information is retrieved based on the key closeness in a DFS manner. • Others such as Chord. Systems that focus on scalable object location, which becomes feasible by hashing and distributing objects in the P2P system. (Searches are by Identifier).
Conclusions • P2P systems offer several advantages such as scalability, robustness and simplicity of use. • Efficient P2P Information Retrieval is not feasible with the current Search Algorithms. • We propose an Intelligent Search Mechanism that uses local knowledge to improve Information Retrieval in P2P. • Our mechanism achieves 90% recall rate while using only 35% of the initial messaging.
Future Work • We plan to deploy our middleware infrastructure on a larger P2P network with more Queries. • We want to probe different Network Topologies such as ASMap with PowerLaws. • We want to probe different Peer-Profile maintenance policies at peers. • Compare the performance of our method with different proposed algorithms (iterative deepening, local indexes, etc).
“A Local Search Mechanism for Peer-to-Peer Networks” Vana Kalogeraki, Dimitrios Gunopulos & Demetris Zeinalipour (University of California – Riverside) < vana@cs.ucr.edu, dg@cs.ucr.edu, csyiazti@cs.ucr.edu > CIKM 2002 – Eleventh International Conference on Information and Knowledge Management November 4-9, Mclean VA http://www.cs.ucr.edu/~csyiazti/publications.html