380 likes | 396 Views
This project explores a self-organizing search engine for distributed content-based search on structured peer-to-peer overlay networks. The goal is to build a scalable and efficient search engine capable of indexing and searching rich content such as HTML, plain text, music, and image files. Two algorithms, V-hash and E-hash, are proposed for controlled placement of document indices on the overlay network to improve search accuracy. Experimental results show that the system achieves comparable accuracy to centralized information retrieval systems with significantly lower resource consumption.
E N D
Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks Chunqiang Tang*, Zhichen Xu Sandhya Dwarkdas*, Mallik Mahalingam HP Labs Hewlett-Packard Company *Univ. of Rochester
Motivation • 93% of information produced worldwide is in digital form • Unique data added yearly exceeds one exabytes (or 1018 bytes) • The volume of digital content is estimated doubling annually • The contents are becoming richer • Efforts are undertaken to make these contents easier to access (e.g.,QBIC, mpeg7) This calls for scalable infrastructures capable of indexing and searching rich content such as HTML, plain text, music, image files and so forth This particular work focus on content-based search Zhichen Xu
Motivation (cont’d) • P2P systems scalable, fault-tolerant, self-organizing • Progress made in storage, DNS, media streaming, web caching… Raising hope for a self-organizing distributed search engine • Content-based search in P2P is NOT yet unsolved • Most systems use simple keyword matches • ignore developments in informational retrieval (IR) • Hard to do , e.g., search for a song by whistling a tune, or search for an image by submitting a sample of patches • They also have efficiency and accuracy problems • Centralized indexing, index/query flooding • Inaccuracy, high-maintenance cost of heuristic-based approaches Zhichen Xu
The goals of our project • Build a self-organizing search engine out of P2P nodes • Extend centralized IR algorithms, • Vector space model (VSM) and latent semantic indexing (LSI) • Documents and queries as vectors; not specific to texts, [P. Raghaven] • Differences to “centralized” systems such as Google • Designed for Web search, harness explicit cross reference information • Explicit cross reference does not always exist in all digital contents • On the other hand, there are “richer” inter-relationships that the search engine can make use of [see our HotOS’03 paper] • P2P systems are self-organizing, low cost, easy of deployment, infinite scalability…. Zhichen Xu
Our approach, pSearch • A fundamental problem of existing approaches: Documents are “randomly” distributed, a query either has to search a large number of nodes, or has to suffer high probability of missing important documents • Controlled placement of document indices in an overlay such that distances reflects the dissimilarity in content • Two algorithms: • V-hash (whole vector hashing) requires overlay to have Cartesian space abstraction (historically pLSI) • E-hash (hash on individual elements) (historically pVSM) Zhichen Xu
Benefits of controlled placement for search • With VSM or LSI, documents and queries are vectors in a Cartesian (semantic space) • Similarity is measured as distance in the semantic space query A B C documents Zhichen Xu
query A B CAN zones C documents V-hash: map the semantic space to CAN Zhichen Xu
Highlight of results • Achieve an accuracy comparable to centralized information retrieval system by visiting a small number of nodes E.g.,with proper configuration, • A system with 128,000 nodes and 528,543 documents (from news, magazines, etc), • pSearch searches only 19 nodes and transmits only 95.5 KB data during the search, • the top 15 documents returned by v-hash and LSI have a 91.7% intersection Zhichen Xu
Overview • Background • A basic parallel LSI (v-hash) algorithm to highlight challenges • Solutions to the challenges • Experimental results • Discussions • Conclusions Zhichen Xu
Background---VSM and LSI • Documents and queries are vectors in a Cartesian space • Similarity between a query and a document is measured as the cosine of the angle between their vector representations • Precision of LSI ranges from comparable to up to 30% better than that of VSM • LSI can bring together documents that are semantically related even if they do not share terms • e.g., a search for car may return relevant documents that uses automobile in the text Zhichen Xu
Background---The vector space model (VSM) • If a term t appears often in a document, then a query containing t should retrieval that document • A term’s scarcity across the collection is a measure of its importance • Documents and queries are both vectors • Di = (wi,1, wi,2, … wi,t) • Wd,t = tfd,t x idft tfd,t the frequency of t in document d; Idft inverse document frequency • There are many variations…. • Similarity: d . q/(|d|.|q|) Zhichen Xu
Background---Latent Semantic Indexing (LSI) • Map term space to lower dimensional concept space • LSI --- Singular Value Decomposition (SVD) • Let A be an n x m matrix of rank r, 1 2 …rare the singular values of A • A = UDVT, where D = diag(1, 2 , …,r) is an r x r matrix, U = (u1, …, ur) is an n x r matrix, and V = (v1, …, vr) is an m x r matrix • LSI omits all but the k largest singular values of A, i.e., • Ak=Uk Dk VkT, where Dk = diag(1, 2 , …,k), Uk = (U1, …, Uk)andVk = (v1, …, Vk) Zhichen Xu
Background --- CAN [Ratnasamy01] zone node • Cartesian space partitioned into zones • A node serves as “owner” of a zone • A key is a “point” in the Cartesian space • Object stored on node that owns the zone that contains the point (key) Zhichen Xu
Low maintenance cost & self-organizing… new zone new node • A node only needs to know the owners of its neighboring zones • Node join: pick a point and split zone with node currently owns the point • Node departure: a neighboring node takes over “state” of the departing node • Dynamisms are shielded from the users and applications! Zhichen Xu
Object lookup translates to logical routing 1 2 3 • Find the node who is the owner of the zone that contains the point • Routing: traverse a series of neighboring zones from source to destination Zhichen Xu
A basic parallel LSI algorithm (naïve v-hash) • 1: query routing • 2: local query + localized flooding • 3: results routing 1 CAN zones query A B 3 2 2 3 C documents Zhichen Xu
It is more complicated … • Dimensionality of semantic space typically very high • 50-350 for IR corpuses; expect to increase as the corpus size • Nearest neighbor search in a high dimension is very difficult • Dimensionality of CAN is much lower • When k log(n) and zones are partitioned evenly, each node has only log (n) neighbors • CAN can only partition a small number of dimensions • Uneven distribution of semantic vectors in semantic space • Global information Solutions: hierarchical clustering, rolling-index and content-directed search Zhichen Xu
Problems due to dimension mismatch: an example • Semantic space of 4 dimensions • Vd = (-0.1, 0.55, 0.57, -0.6), Vq = (0.55, -0.1, 0.6, -0.57) • Vd and Vq are similar on elements 2 and 3 (in red) • If CAN only partitions the first two dimensions 1 Vd Vq 1 -1 Zhichen Xu
Intuitions behind our solutions • The dimensions relevant to a particular document is typically a much smaller number • Queries submitted to search engines can be very short, averaging less than 2.4 terms per query [Lempel & Moran] Zhichen Xu
Our solutions • Use clustering algorithms to identify the clusters of semantic vectors that corresponds to e.g., chemistry, computer science, etc. [Not yet evaluated] • Rotate the semantic space and map each of the rotated space to the same CAN • Use the contents stored on the neighboring nodes and queries received in the recent past to guide search Zhichen Xu
Hierarchical clustering- high-level idea 1 cluster digest 2 cluster cluster cluster cluster digest 2.3 digest 2.4 digest 1 digest 3 0 CAN cluster cluster digest 2.2 digest 2.1 0.5 0.5 CAN cluster cluster digest 1.3 digest 1.4 cluster cluster 0.25 digest 1.2 digest 1.1 0 CAN 0 0.25 CAN 0.5 • Digests are typically made of most important concepts (terms) in a domain • Challenge: efficiently/effectively decide which cluster a document/query falls into Zhichen Xu
e0, e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11 e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e0, e1 Rolling-index Original vector for a document (or query) (e0, e1) e9, e10, e11, e0, e1, e2, e3, e4, e5, e6, e7, e8 Vector rotated by 2-elements (e2, e3) (e9, e10) Zhichen Xu
An example of rolling-index • Semantic space of 4 dimensions • Vd = (-0.1, 0.55, 0.57, -0.6), Vq = (0.55, -0.1, 0.6, -0.57) • Vd and Vq are similar on elements 2 and 3 (in red) 1 1 Vd Vd Vq Vq 1 1 -1 -1 Precision at the cost of replication Zhichen Xu
Properties of SVD • Sorts elements in semantic vectors by decreasing importance. A large number of documents discussing popular concepts are likely to be correctly classified by a relative small number of low-dimension elements Zhichen Xu
Query accuracy distribution time • A total of 100 queries • 4 rotated spaces, each rotated the previous space by 25 • Accuracy: percentage overlap with a centralized baseline Zhichen Xu
Content-directed search • Curse of dimensionality • High-dimensional data spaces are sparsely populated. Even very large hyper-cube in high-dimensional spaces are not likely to contain a point • The distance between a query and its nearest neighbor (NN) grows steadily with the dimensionality of the space • Use the contents stored on nodes and recently processed queries as a hint to guide the search to the right places • Uses samples from other nodes to determine content similarity between a query and content stored on the nodes Zhichen Xu
Content-directed search • Search for two documents • N: list of nodes to search • Step 1: N = {6,14,11,9} • Step 2: a is identified and N = {7,14, 11, …} • Closest document may not be on direct routing neighbor 1 2 3 4 5 6 7 8 a b 9 10 11 12 q 13 14 15 16 Zhichen Xu
Content-directed replication & caching • Selectively replicate contents stored on surrounding nodes • The threshold is set according to the node’s storage capacity, computing power, and network connectivity 1 2 3 4 5 6 7 8 a b 9 10 11 12 q 13 14 15 16 Zhichen Xu
Experimental Results • Software packages • SMART [Cornell] + LAS2 from SVDPACK [netlib]+eCAN sim • Validate the correctness using MEDLINE corpus [Buckley] • Experiment with TREC-7,8; Topics 351-450 as queries • term by document matrix by sampling 15% documents • 79,316 sampled docs and 83,098 indexed terms • Project all 528,543 docs onto 300 dimensions after SVD • Metrics • Number of visited nodes • Accuracy = (|A B| / |A|) x 100%, A : set of documents returned by LSI, and B: set of documents returned by v-hash Zhichen Xu
Scalability with respect to the system size • As system size increases exponentially, the number of visited nodes increases only moderately • For 32k system, v-hash can achieve an accuracy of 90% by visiting 139 nodes Zhichen Xu
Effect of the number of returned documents • 10,000 nodes in total • The number of visited nodes grows quickly, but the average number of nodes that needs to be searched to return one document decrease drastically Zhichen Xu
Using actual contents and past queries to direct searches When queries have locality, learning from past history can increase the accuracy while reducing the number of visited nodes Zhichen Xu
Replication improves search efficiency and accuracy • Visit 24 nodes in a 10,000 node system to achieve accuracy higher than 96.8% • Replicating direct neighbor’s content. The scalability declines from O(n) to O(n/log(n)) Zhichen Xu
An example of a large system of 128 K nodes • Repl-query series uses both the content and past queries to guide the sampling • Combining replication and the query heuristics, it can achieve an accuracy of 91.7% by visiting 19 nodes, or an accuracy of 98% by visiting 45 nodes Zhichen Xu
Discussions • V-hash requires the overlays to have Cartesian space abstraction • for an individual doc, query, the most significant elements may not be contiguous • clustering is needed for larger corpus element hashing (e-hash) algorithm eliminates the constraint • We expect the content-directed search to improve as the size of the corpus size increases, • Selective content replication and query result caching have the potential to substantially improve the performance while keeping the scalability of the storage high • Study how other IR algorithms such as PageRank can complement our approach • Integrate attribute-based, content-based, and context-based search Zhichen Xu
E-hash • Query = global_rank (sigma (local_ranking)) • Intelligent storage management based on query patterns Computer: w1 Network: w2 … sports: w1 Network: w2 … Overlay e.g., Chord, Pastry Zhichen Xu
Conclusion • pSearch is the first system that organizes contents around their semantic in a P2P network. • This makes it possible to achieve an accuracy comparable to state-of-the-art centralized IR systems while visiting only a small number of nodes. • We propose the use of hierarchical clustering, rolling-index, and content-directed search to reduce the dimensionality of the search space and to resolve the dimensionality mismatch between semantic space and CAN • We employ content-aware node bootstrapping to balance the load Zhichen Xu