180 likes | 292 Views
Document retrieval. Similarity Vector space model Multi dimension Search Range query KNN query Query processing example. Range Query. y axis. 10. m. g. h. l. 8. k. f. e. E. 2. 6. i. j. E. d. 1. 4. b. a. 2. c. x axis. 10. 0. 8. 2. 4. 6. Root. E. E. 1. 2.
E N D
Document retrieval • Similarity • Vector space model • Multi dimension • Search • Range query • KNN query • Query processing example
Range Query y axis 10 m g h l 8 k f e E 2 6 i j E d 1 4 b a 2 c x axis 10 0 8 2 4 6 Root E E 1 2 E E E E E E 1 E 3 4 5 6 7 2 e a c d g b f j m i l h k E E E E E 4 5 3 6 7
Information retrieval in Structured P2P overlay • High dimension -> Low dimension • Dimension reduction! • Support range query and KNN query • Guarantee precision and recall
Peer-to-Peer VSM (pVSM) • VSM : vector space model • Basic ideas • The m most heavily-weighted terms ti, i=1,…,m are identified • The corresponding (h(ti), index) pairs are stored in DHT • Index : pointer to the actual document.
Peer-to-Peer LSI (pLSI) • LSI : Latent semantic indexing • Use SVD to transform and truncate a matrix of a terms vectors computed from VSM to discover the semantics of terms and documents • Basic idea • l: dimensionality of LSI semantic space • k: dimensionality of Can cartesian space • Make l=k
pLSI (cont.) • Challenges for pLSI • Sphere distribution of semantic vectors • Solution • Transforming the sphere space
SVD 1 2 Latent Semantic Indexing vector space model TSVD project new vectors compute similarity
iDistance – Indexing the Distance • Space partitioning into n clusters • Reference points pi • Each cluster mapped to an interval • Each object x mapped to 1-diDist(x)=i*c+dist(pi,x) • Values indexed in a B+-Tree
Query R(q,r) • If a query intersects with a clusterdist(pi,q)-r≦ri • Scan the interval[i*c+dist(pi,q)-r,i*c+dist(pi,q)+r]
M-Chord • Basic principles • Choose a set of n pivots p0,…,pn-1 from a priori given sample dataset • Divide the set of indexed objects I into clusters C0,…, Cn-1: • Every object x may be excluded without evaluating d(q,x) if
M-chord • Pivot selection • Influence the performance of the search algorithm • Publish • Use iDistance to map the dataset into a one-dimensional domain and join this domain with the Chord protocol • Using order preserving function h to a [0,2m) interval
M-Chord • Data structure • Chord routing information • B+-tree storage for the (Ki-1, Ki] (mod 2m) interval
M-Chord • Range search • for each cluster Ci, determine interval Ii of keys to be scanned: • send an INTERVALSEARCH(Ii, q, r) request to node NIi responsible for the midpoint of interval • wait for all responses and create the final answer set.
M-Chord • INTERVALSEARCH(Ii, q, r)
M-Chord • KNN search • The iDistance approach to KNN query processing a sequence of range queries with growing radius is not suitable for distributed environment • multiple range iterations would result in an unpleasant number of successive message transmissions increasing the overall response time • Solution • Employ a low-cost heuristic to find k objects that are near q • Run the Range(q, Qk) query and return the nearest objects from the query result