310 likes | 341 Views
This research focuses on enhancing search accuracy in P2P file-sharing systems by automatically refining file descriptions. The proposed approach incorporates descriptors from other peers to enrich local descriptions, enhancing query matching efficiency. Experimental results demonstrate significant improvements in search performance.
E N D
Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April 11, 2008 D. Jia, W. G. Yee, L. T. Nguyen, O. Frieder. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems. In Proc. of the 7th IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), Ireland, Sept. 2007
Outline • Objective • Problem • Proposed Approach • Experimental Results • Conclusions
Objective • To improve the accuracy of search in P2P file-sharing systems. • Finding poorly described data
Problem Statement • Characteristics: • Binary files (e.g., music file). • Each replica described with a descriptor. • Sparse. • Vary across peers. • Queries are conjunctive. • Problem: poor/sparse description makes files hard to match with queries!
Approach • Peers independently search the network for other descriptors of local files • Incorporate them into the local replica’s descriptor • Search implemented by “probe” queries
Example Peer2 Peer1 Peer3 Two descriptors of File F: D1, D2. Q = {Mozart, piano}. D1 = {Mozart} D1’ = {Mozart, piano} D3 = {Mozart, piano} tell me your description of F Q = {Mozart, piano} D1’ = {Mozart, piano} tell me your description of F No result returned for Q! D2 = {piano} Q = {Mozart, piano} D2 = {piano}
How P2P File-Sharing Systems Work • Peers share a set of files. • Each replica of a file is identified by a descriptor. • Every descriptor contains a unique hash key (MD5) identifying the file. • Query is routed to all reachable peers. • Query result contains its descriptor and the identity of the source server.
Probe Query Design • Contains one term: the key of a file. • Matches all the replicas of the file reached by the probe query.
Design Challenges • When to probe • What file to probe • What to do with probe results • How to control cost • Do this in a fully distributed way
When to Probe? • When a peer is not busy and under-utilized • Measured by number of responses returned Nr • When a peer has a high desire to participate • Measured by number of files published Nf • When the system is active • Measured by number of queries received Nq
When to Probe? (Cont’d) • Triggering mechanism: T > Nr/NfNq + NpT, Nf, Nq > 0 Where T: user-defined threshold Np: number of probe queries performed Nr/NfNq: number of results returned per shared file per incoming query • All the metrics are locally maintained by each peer, easy to implement
What File to Probe? • Goal is to increase participation level • Criteria to choose from: • File that is least probed (RR) • File that is in theleast or most query responses (LPF or MPF) • File with a smallest descriptor
What to do with Probe Results? • Select terms from the result set to add to the local descriptor • Random (rand) • Weighted random (wrand) • Most frequent (mfreq) • Least frequent (lfreq) • Stop when local descriptor size limit is reached
Experimental Setup Parameters Used in the Simulation: Query Length Distribution:
Metrics • MRR (mean reciprocal rank) = • Precision = • Recall = A: set of replicas of the desired file. R: result set of the query.
Data • TREC wt2g Web track. • Arbitrary set of 1000 Web docs from 37 Web domains. • Preprocessing • Stemming and removing html markup and stop words. • Final data set • 800,000 terms, 37,000 are unique.
Experimental Results – Applying Probe Results to Local Descriptors MRR with Various Term Copying Techniques
Experimental Results - Probe Triggering • No probing (base case). • Random • Assign each peer a probability of probing. • 5K probes are issued over the 10K queries. • T5K • Tune T to perform 5K probes over the 10K queries.
Experimental Results - Probe Triggering (Cont’d) MRR: random + 20%; T5K +30%
Experimental Results - Probe Triggering (Cont’d) Probing dramatically increases MRR of longer queries. Solve query over-specification problem.
Experimental Results - Probe Triggering (Cont’d) Effect of Various Probing Rates on MRR.
Experimental Results - Probe File Selection • Rand – randomly select a file to probe (base case). • LPF – least popular first. • Min query hits; on a tie, min descriptor size. • MPF – Most popular first. • Max query hits; on a tie, min descriptor size. • RR-LPF – round-robin-LPF. • Min probes; on a tie, LPF. • RR-MPF – round-robin-MPF. • Min probes; on a tie, MPF.
Experimental Results - Probe File Selection (Cont’d) Compared with Rand base case, only RR-MPF has better performance (~10%) and lower cost (~-10%).
Putting Them Together… Probing improves MRR by ~30% Probe: T5K, RR-MPF, wrand
Explanation • Triggering: tuning T • Probes are issued by underactive peers. • File selection: • RR avoids same file being probed repeatedly • MPF improves peer’s ability of sharing popular files • Term copying: • wrand selects from bag of words in proportion of frequency • Allows new queries to be matched with a bias toward more strongly associated terms
How to Control Cost? • Cost components: • Probe query results • File query results • Cost: avg number of responses per file query • Randomly sample each type of result on server side with a probability P • Impact on performance?
Performance/Cost Analysis Total Per-file-query Cost for Different File and Probe Query Sampling Rates
Performance/Cost Analysis (Cont’d) MRR is increased in all sampling settings
Performance/Cost Analysis (Cont’d) Example: It can both reduce the cost (-15%) and improve the performance (18%) Cost Performance
Conclusions and Future Work • Probing enriches data description • MRR is improved by 30% • Sampling is effective in controlling cost • Reduce cost by 15% and improve performance by 18% at the same time • We consider better ways of controlling cost
Thank You! • Any Questions?