410 likes | 421 Views
Explore the use of semantic overlay networks for efficient peer-to-peer information retrieval. Learn about P2P search applications, architectures, and innovative techniques like Rolling-Index and Content-Directed Search.
E N D
Search in P2P architecture Ge Xu, Nadia Sheikh, Xia Wang
P2P Data Dissemination and Retrieval • We will discuss • Two types of P2P search applications • Use P2P as search engine for IR • P2P file sharing and retrieval • Two kinds of P2P architectures • Structured • Unstructured
Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks C. Tang, Z. Xu, S. Dwarkadas SIGCOMM’03, Kalsruhe, Germany Ge Xu
Information Retrieval (IR) • IR systems are more and more popular and important for internet applications
Centralized IR Systems • Problems • Scalability • Server failure Server client client client client
IR on P2P • P2P architecture can be used for building highly scalable IR systems • Key Problem--randomly distributed data • Flooding: too expensive • Selective: risk of missing document. • This paper aims to release this problem by semantic overlay
Some preparation knowledge of IR • Vector Space Model (VSM) • Every document is present as a vocabulary vector D=( ) • As the same, the query Q is also present as a vector ( ). • The similarity between D and Q is computed by • Key problem of VSM. • Dimension of vector is too large.
Some preparation knowledge of IR • Latent Semantic Indexing (LSI) • Project the vector to a low-dimension space by singular value decomposition (SVD). • Actually, LSI clusters documents by topic. • The semantic related documents are projected to vector space closely. • How to organize a network whose nodes are corresponding to a vector space.
Semantic Overlay Network • Content-Addressable Network (CAN) • CAN partitions a d-dimension Cartesian space into zones. • Assigns each zone to a node. • When a new node join, assign it a point in Cartesian space, route to the zone contains the point, then split that zone and assign the new zone to the new node. B E A (0-0.5) (0-0.5) (0,5-0.75) (0-0.5) (0,75-1) (0-0.5) C D (0-0.5) (0.5-1) (0.5-1) (0.5-1) A 2-dimension CAN
Main Idea of PSEARCH A D C B PSEARCH Engine
Discussion about this strategy • Advantages • All data transmission is query and reference of result which is independent with corpus size. • Scalable, failure-tolerance. • Key problems it will cope with • The dimension of semantic vector is much larger than the overlay network. • Semantic vectors are not distributed evenly in semantic space. • It is proved that limiting search region in high-dimension space is difficult.
Rolling-Index: Dimension Difference • Motivation • The number of dimension relevant to a particular document is much smaller. • Queries are usually short and likely to be captured by a few concepts. • Algorithm • Rotate the semantic vector V=( ) to some rotated space • The index of document is distributed in each rotated space i. • Query is also rotated and routed to every corresponding node.
Balancing Index Distribution Idea is that assign special Cartesian space to node. • When new node joins, randomly select a document, randomly select a rotate space i, decide node position by this space.
Content-Directed Search • Main Idea: use the content of node and the search history to guide the current search. • Algorithm • Figure out a vector which contains info about content and search history. • Compare the query vector with it. • Set a threshold to stop search steps.
Summery • PSEARCH is the first P2P IR system on semantic overlay. • It make the IR system scalable and reduce the computation and data transmission in a large degree. • One problem is that it is not proved that it could be used in huge corpus such as web IR which contains billions of files.
YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology Authors: Prasanna Ganesan, Qixiang Sun, Hector Garcia-Molina Stanford University, INFOCOM2003 Presenter: Xia Wang
Outline • Research problem • Existing solutions • Overview YAPPERS • Maintenance • Enhancement • Evaluation • Conclusion
Research problem • Building a peer-to-peer lookup service on top of an arbitrary dynamic P2P overlay network. • A lookup service is a service that maintains a dynamic set of key-value associations (multiple values for a single key), and permits queries that request values associated with a key. • Partial lookup: a query that requests some values of a key • Total lookup: a query that requests all the values of a key
Existing solutions • Gnutella: no organization of the content in the network. Each node simply stores its own <key, values> pairs. • Total lookup in Gnutella requires flooding the entire network. • Frequent network topology changes have very little impact on performance.
Existing solutions (Cont’) • DHT-based systems: Can, Chord, Pastry, and Tapestry • A distributed hash table on top of the overlay networks. • Keys are hashed into a keyspace, each node is responsible for a small segment of the keyspace. • Advantage: Lookup is more efficient • Disadvantages: • Require strict constraints on the overlay topology Node joins and departures may force state changes at many other nodes. • Hosts with popular files may become hot spots
Goal of YAPPERS(Yet Another Peer-to-Peer system) • Impose no constraints on topology • Optimize for partial lookups. • Contact only nodes that can contribute to the results rather than flooding blindly. • Minimize the effect caused by topology changes.
Overview of YAPPERS • The keyspace is partitioned into a small number of buckets (colors), each node is also assigned a color. • <key, value> is stored on a node with the same color as the key. • A query for a key is forwarded to those nodes that have the same color as the key.
Partition Nodes Given any overlay, first partition nodes into buckets (colors) based on hash of IP Cited from author’s presentation in INFOCOM2003
Partition Nodes Given any overlay, first partition nodes into buckets (colors) based on hash of IP Cited from author’s presentation in INFOCOM2003
Partition Nodes (2) Around each node, there is at least one node of each color X Y Cited from author’s presentation in INFOCOM2003 May require backup color assignments
Store yellow content at a yellow node Register Content Partition content space into buckets (colors) and register pointer at “nearby” nodes. Nodes around Z form a small hash table! Z Store red content locally Cited from author’s presentation in INFOCOM2003
Searching Content Start at a “nearby” colored node, search other nodes of the same color. W X Y U V Z Cited from author’s presentation in INFOCOM2003
Searching Content (2) A smaller overlay for each color and use Gnutella-style flood Cited from author’s presentation in INFOCOM2003 Fan-out = degree of nodes in the smaller overlay
Basic Algorithm • IN(A) = {v | δG(v, A) <= h} where δG returns the minimum distance between two nodes in G. So IN(A) contains all nodes within h hops of A in the overlay network, including A itself. • A node X with IP address IPx is assigned key k if HASH(k) (HASH(IPx) mod b) where b is the number of distinct hash buckets.
Basic Algorithm (Cont’) • Multiple nodes in IN(X) have the same color as the key k --- randomly pick one • No nodes in IN(X) have the same color as the key k --- Backup scheme • (Backup): If no Ci isin IN(X) color Ci is assigned to a node with color C(i+1)mod b. If there are multiple nodes of color C(i+1)mod b, choose the node with the smallest IP address. (like successor(k) in Chord)
Basic Algorithm (Cont’) • frontier of node A N(v) is the set of nodes directly connected to node v • Extended neighborhood
B, C are included in IN(A) • Nodes D and E are in the F(A). • Therefore, EN(A) includes H and J. • According to the definition, EN(A) consists of all nodes within 2h+1 hops of A Fig. 2. Relationship between IN(A), F(A), and E(A)
Total lookup example Fig. 1. Using both the immediate (denoted by IN) and the extended (denoted by EN) neighborhood information, a lookup for a key kg at node A can reach all nodes that stores key kg
Since YAPPERS uses a 2h+1 hops extended neighborhood, and edge deletion requires both X and Y to broadcast the event to its surviving neighbors with a time-to-live field of 2h hops. X Y 2h hops 2h hops Topology maintenace • Edge deletion:
Edge Insertion X needs Y’s neighbors information to build its IN(X) and EX(X). Z only needs to know the topology information that is within 2h-1 hops of Y. X Z Y
Enhancement • Fringe node problem: Using the backup assignment, a node with high connectivity may be assigned by its neighbors more colors than it desires. • Large fan-out problem: Each of A’s frontier nodes may point to one or more different forwarding nodes. Consequently, the forwarding fan-out degree at node A is proportional to the number of the frontier nodes. Ex. If an overlay network’s average node degree is 5, then the fan-out degree is O(5h+1) In a star topology, the central node A is overloaded by the fringe nodes B, C, D, and E as they assign large chunks of key space to A.
Enhancement (Cont’) • Fringe node problem: • Prune away low connectivity nodes. Ex. Recursively prune all nodes with degree 1 from the overlay network. • Biased backup node assignment: A node with small immediate neighborhood will not assign backup colors to a node with large immediate neighborhood.
Enhancement (Cont’) • Reducing Forward Fan-out: • If a frontier node F assigns C to node B via the backup mechanism, then forward the request to B. • If a frontier node F assigns C to a set of nodes S, do not forward to any nodes in S if SIN(A). Otherwise, only forward to one of the nodes in S. • In step (2), when choosing one node from S, try to pick common nodes between multiple frontier nodes.
Evaluation Search Efficiency
Evaluation Search overhead
Conclusion • Advantages: • YAPPERS is a Hybrid that uses distributed hash tables (DHTs) in small areas and uses intelligent forwarding over large areas. • Disturbs only a small fraction of the nodes in the overlay for each search. • No requirements of restructuring the underlying overlay network • Each node requires only knowledge of a small neighborhood and is independent of the rest of the overlay and is less affected by the node arrivals and departures. • Disadvantages: • Although YAPPERS expects small affection of the whole network on node departure and arrival, the actual cost may not be much better than other DHT-based system. No comparison to other DHTs has been done in the paper. • Fan-out degree is much higher than Gnutella even with enhancement.