Exploiting locality for scalable information retrieval in peer-to-peer networks

Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous November, 2004

Issues in p2p networks • Content based / file identifiers information retrieval. • Dynamic networks (ad-hoc). • Scalability (global knowledge). • Query messages (flooding – network congestion). • Recall rate. • Efficiency(recall rate / query messages). • Query Response Time (QRT).

IR in pure p2p networks • BFS technique • Each peer forwards the query to all its neighbors • Simple • Performance • Network utilization • Use of TTL • RBFS technique • Each peer forwards the query to a random subset of its neighbors • Reduce query messages • Probabilistic algorithm

IR in pure p2p networks • >RES technique • Each peer forwards the query to some of its peers based on some aggregated statistics. • Heuristic: The Most Results in Past (for the last 10 queries). • Explore.. • The larger network segments. • The most stable neighbors. • ! (The nodes which contain content related to the query.) >RES is a quantitative rather than qualitative approach.

The intelligent search mechanism (ISM) • Main Idea: Peers estimate for each query, which of its peers are more likely to reply to this query, and propagates the query message to those peers only. • Exploit the locality of past queries. • Some characteristics: • Entirely distributed (requires only local knowledge). • Scales well with the size of the network. • Scales well to large data sets. • Works well in dynamic environments. • High recall rates. • Minimize the communication costs.

Architecture (ISM) (1/4) • Profiling structure: • Single queries table • LRU policy to keep the most recent queries • Table size is limited  good performance

Architecture (ISM) (2/4) • Query Similarity function (cosine similarity) • Assumption: A peer that has a document relevant to a given query is also likely to have other documents that are relevant to other similar queries. Qsim : Q2 [0,1] L: the set of all words appeared in queries {1,1,1,1} q:{1,1,0,0} qi:{1,0,1,0}

Architecture (ISM) (3/4) • Peer ranking (Relevance Rank) Pi: each peer. Pl: the decision-maker node. a: allows us to add more weight to the most similar queries. S(Pi, qj): the number of results returned by Pi for query qj.

Architecture (ISM) (4/4) • Search Mechanism • Invoke RR function. • Forward query to k (threshold) peers only.

Experiments • Peerware: A distributed middleware infrastructure • GraphGen: generates network topologies. • dataPeer: p2p client which answers to boolean queries from its local xml repository(XQL). • SearchPeer: p2p client that performs queries and harvest answers back from a Peerware network (connect to a dataPeer and perform queries).

Experiments - DMP • If node Pk receives the same query q with some TTL2, where TTL2>TTL1 we allow the TTL2 message to proceed. • This may allow q to reach more peers than its predecessor • Without this fix the BFS behaviour is not predictable and therefore is not able to find the nodes that we were supposed to find. • Our experiments revealed that almost 30% of the forwarded queries were discarded because of DMP. • The experimental results presented in this work are not suffering from DMP. • This is the reason why the number of messages is slightly higher (~30%) than the expected number of messages. • The total number of messages should be for n nodes each of which with a degree di.

Experiments-DMP • Query examples • A set of 4 keywords • 1 keyword >= 4 characters • Random Topology: • Each vertex selects its d neighbors randomly. • Simple. • Leads to connected topologies if the degree d > log2n.

Experiments (Set1) • Reuters – 21578 Peerware • Random topology of 104 nodes (static) with average degree 8 (running on network 75 workstations). • Categorize the documents by their country attribute (104 country files - each for a node) - Each country file has at least 5 articles. • Data Sets: • Reuters 10X10: set of 10 random queries which are repeated 10 consecutive times (high locality of similar queries) – suits better to ISM. • Reuters 400: set of 400 random queries which are uniformly sampled from the initial 104 country files (lower repetition).

Results (Set1) – Reuters 10X10 (1/4) • Reducing query messages • ISM finds the most documents compared to RBFS and >RES. • ISM achieves almost 90% (recall rate) while using only 38% of BFS’s messages. • ISM and >RES start out with low recall rate. • Suffer from low recall rate.

Results (Set1) – Reuters 10X10 (2/4) • Digging deeper by increasing TTL • Reach more nodes deeper. • ISM achieves 100% recall rate while using only 57% of BFS’s messages with TTL=4.

Results (Set1) – Reuters 10X10 (3/4) • Reducing query response time (QRT) • ~30-60% of BFS’s QRT for TTL=4 and ~60-80% for TTL=5. • ISM requires more time than >RES because it’s decision involves some computation over the past queries.

Results (Set1) – Reuters 400 (4/4) • Improving the recall rate over time • ISM achieves 95% recall rate while using 38% of BFS’s messages. • During queries 150-200 major outbreaks occur in BFS. • ISM requires a learning period of about 100 queries before it starts competing the performance of >RES.

Experiments (Set2) • TREC-LATimes Preeware (random topology of 1000 nodes – static) • It contains approximately 132,000 articles. • These articles were horizontally partitioned in 1000 documents (Each document contain 132 articles). • Each peer shares one or more of 1000 documents (replicated articles).

Experiments (Set2) • Data Sets: • TREC 100: a set of 100 queries out of the initial 150 topics. • TREC 10X10: a list of 10 randomly sampled queries, out of the initial 150 topics, which are repeated 10 consecutive times. • TREC 50X2: for which we first generated a set a=“50 randomly sampled queries out of the initial 150 topics” merged with a generated list of another 50 queries which are randomly sampled out of a.

Results (Set2) – TREC100 (1/3) • Searching in a large-scale network topology • For TTL=5 we reach 859 of 1000 nodes (BFS). • For TTL=6 we reach 998 of 1000 nodes at a cost of 8500 m/q. • For TTL=7 we reach all nodes at a cost of 10,500 m/q. • ISM will not exhibit any learning behavior if the frequency of terms is very low.

Results (Set2) – TREC 10X10 (2/3) • The effect of high term frequency • The recall rate will improve dramatically if the frequency of terms is high. • ISM achieves higher recall rate than BFS (BFS’s TTL=5). • After the learning phase of 20-30 queries it scores 120% of BFS’s recall rate by using 4 times less messages.

Results (Set2) – TREC 50X2 (3/3) • The effect of high term frequency • More realistic set, a few terms occur many times in queries and most terms occur less frequently. • ISM monotonically improves its recall rate and at the 90th query it again exceeds BFS performance. • >RES’s recall rate fluctuate and behave as bad as RBFS if the queries don’t follow any constant pattern.

Experiments (Set3) • Searching in dynamic network topologies • Why network failures? • Misusage at the application layer (shutdown PC without disconnecting). • Overwhelming amount of generated network traffic. • Because of some poorly written p2p clients. • Simulate dynamic environment • Total number of suspended nodes is no more than drop_rate. • drop_rate is evaluated every k seconds against a random number r. • If r < drop_rate node will break all incoming and outgoing connections (for l seconds). • In our experiments: • K=60,000 ms and l=60,000 ms. • TREC-LATimes Peerware with the TREC 10X10 query set. • drop_rate belongs to (0.0, 0.05, 0.1, 0.2) • r is a random number which is uniformly generated in [0.0 .. 1.0)

Results (Set3) (1/3) • BFS mechanism • The increase of drop_rate decreases the number of messages. • BFS does not exhibit any learning behavior at any level of drop_rate. • BFS is tolerable to small drop_rates (5%) because is highly redundant.

Results (Set3) (2/3) • >RES mechanism • The increase of drop_rate decreases the number of messages. • >RES does not exhibit any learning behavior at any level of drop_rate.

Results (Set3) (3/3) • ISM mechanism • The increase of drop_rate decreases the number of messages. • Quite well at low levels of drop_rate. • Not expected to be tolerant to large drop_rates (The information gathered by the profiling structure becomes obsolete before it gets the chance to be utilized).

Extend ISM to different environments • ISM mechanism could easily become the query routing protocol for some hybrid p2p environments (KaZaa, Gnutella). • Super Peers form a backbone of infrastructure (long-time network connectivity). • Regular Peers are unstable and less powerful. • How could it work? • Regular peer obtain a list of active Super peers. • Connects to one or more Super peer and post queries. • Super peer utilize the ISM mechanism and forward the query to a selective subset of its super peer neighbors.

Thank you

Exploiting locality for scalable information retrieval in peer-to-peer networks

Exploiting locality for scalable information retrieval in peer-to-peer networks

Presentation Transcript

Peer-To-Peer Networks

Peer-to-Peer Networks

Peer-to-Peer Networks

Information Retrieval Techniques For Peer-To-Peer Networks

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network

Peer-to-peer networks

Peer to Peer Networks

PIER: Peer-to-Peer Information Exchange and Retrieval

PIER ( Peer-to-Peer Information Exchange and Retrieval )

Scalable Peer-to-Peer Virtual Environments

“Information Retrieval in Peer-to-Peer Systems”

Peer-to-Peer Networks

Information Retrieval in Peer to Peer Systems

Peer-to-Peer Networks

Peer-to-peer networks

Information Retrieval in Peer to Peer Systems

Peer-to-Peer Networks

Peer to Peer Information Retrieval

Peer-to-Peer Networks

Peer-to-peer networks