170 likes | 294 Views
Text-Based Content Search and Retrieval in ad hoc P2P Communities. Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/. Motivation. It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics
E N D
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen http://www.panic-lab.rutgers.edu/
Motivation • It is hard to find information in current P2P infrastructures • They are designed for name-based search • They don’t have quality metrics • They don’t rank results • Most are optimized to find popular content • The current Internet search model has proven to be effective to locate data • Intuitive term-based query model • Quality metric and ranking critical factors in success of Internet search engines • Help users to quickly pinpoint relevant documents from vast repository
Goals & challenges • Empower P2P communities with search capabilities similar to Internet search engines • No central servers • Fault tolerance • Cannot employ current model used by Internet search engines • No central management and administration • Resources are fragmented • Peers behaviors are uncontrolled
Summary of PlanetP • Nodes maintain an index of their content • Represented as Bloom filters • Indexes and Directories are replicated everywhere • Gossiping keeps peers synchronized Local Directory Local Directory Local Files Local Files XML Snippets Gossiping [K1,..,Kn] XML Snippets [K1,..,Kn] Bloom filter Bloom filter
Rankresults Local lookup Contactcandidates Ranknodes Local Directory Diane Nickname Keys Alice [K1,..,Kn] Diane Bob Bob [K1,..,Kn] Fred File1 Query Charles [K1,..,Kn] File2 Diane [K1,..,Kn] Fred Diane File3 Edward [K1,..,Kn] Fred [K1,..,Kn] Bob Fred Gary [K1,..,Kn] Content search in PlanetP STOP
Document Query The Vector Space model • Documents and queries are represented as k-dimensional vectors • Word are weighted according to their relevance for the document • Documents are weighted according to their words • The angle between a query and a document indicates its similarity
Weight assignment (TFxIDF) • Idea • Use per doc. Term Frequency (TF) to weight words (WD,t) • Use inverse global popularity (IDF) to find good discriminators among the query terms • Intuition • TF indicates how related a document is to a particular concept • Inverse Document Frequency (IDF) identify the words that are good discriminators between documents • WD,t=f(Frequency of t in D) • IDFt=f(No. documents/Frequency of t across documents)
Node & document ranking in PlanetP • Unfortunately IDF is not suited for P2P • Requires an appearance count for every word in the community • We introduce the use of the Inverse Peer Frequency • IPFt=f(No. Peers/Peers with documents containing t) • IPF can be computed with local information • IPF is compatible across the community
Stopping condition • Intuitive idea: Stop as soon as k documents are retrieved • Not good • A node might have few highly ranked documents and many that have a low rank • We propose an adaptive approach: • Contact nodes one by one and keep a list of the top k documents retrieved • Stop contacting candidates when p nodes in a row fail to contribute to the top k
Evaluation method • We use five well known document collections • Each collection comes with a set of queries and relevance judgments • Here we present results for one (AP89) • We measure recall and precision
Evaluation method • We use a simulator to test our algorithm • Different file distributions • Against a central search engine • Quantifying the effect not using an adaptive stopping condition
More results • Adjusting the stop condition according to the community size and number of results expected • We provide a linear function to determine p • Recall as the community grows to 1000 (scalability) • Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF • 80% on average
Conclusions • PlanetP matches TFxIDF's performance using the TFxIPF approximation • Give P2P communities search capabilities as powerful as environments with centralized resources • TFxIPF is applicable beyond PlanetP • PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community • Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results
Related Work • Tapestry, Pastry, Chord and CAN • Implement a distributed hash table for P2P environments • Oriented towards name based searches (for FS) • They already store all the information needed to implement TFxIPF • Cori and Gloss • Address the problem of indexing and searching distributed collections of documents • They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes