1 / 31

Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems

This research focuses on enhancing search accuracy in P2P file-sharing systems by automatically refining file descriptions. The proposed approach incorporates descriptors from other peers to enrich local descriptions, enhancing query matching efficiency. Experimental results demonstrate significant improvements in search performance.

mmeikle
Download Presentation

Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April 11, 2008 D. Jia, W. G. Yee, L. T. Nguyen, O. Frieder. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems. In Proc. of the 7th IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), Ireland, Sept. 2007

  2. Outline • Objective • Problem • Proposed Approach • Experimental Results • Conclusions

  3. Objective • To improve the accuracy of search in P2P file-sharing systems. • Finding poorly described data

  4. Problem Statement • Characteristics: • Binary files (e.g., music file). • Each replica described with a descriptor. • Sparse. • Vary across peers. • Queries are conjunctive. • Problem: poor/sparse description makes files hard to match with queries!

  5. Approach • Peers independently search the network for other descriptors of local files • Incorporate them into the local replica’s descriptor • Search implemented by “probe” queries

  6. Example Peer2 Peer1 Peer3 Two descriptors of File F: D1, D2. Q = {Mozart, piano}. D1 = {Mozart} D1’ = {Mozart, piano} D3 = {Mozart, piano} tell me your description of F Q = {Mozart, piano} D1’ = {Mozart, piano} tell me your description of F No result returned for Q! D2 = {piano} Q = {Mozart, piano} D2 = {piano}

  7. How P2P File-Sharing Systems Work • Peers share a set of files. • Each replica of a file is identified by a descriptor. • Every descriptor contains a unique hash key (MD5) identifying the file. • Query is routed to all reachable peers. • Query result contains its descriptor and the identity of the source server.

  8. Probe Query Design • Contains one term: the key of a file. • Matches all the replicas of the file reached by the probe query.

  9. Design Challenges • When to probe • What file to probe • What to do with probe results • How to control cost • Do this in a fully distributed way

  10. When to Probe? • When a peer is not busy and under-utilized • Measured by number of responses returned Nr • When a peer has a high desire to participate • Measured by number of files published Nf • When the system is active • Measured by number of queries received Nq

  11. When to Probe? (Cont’d) • Triggering mechanism: T > Nr/NfNq + NpT, Nf, Nq > 0 Where T: user-defined threshold Np: number of probe queries performed Nr/NfNq: number of results returned per shared file per incoming query • All the metrics are locally maintained by each peer, easy to implement

  12. What File to Probe? • Goal is to increase participation level • Criteria to choose from: • File that is least probed (RR) • File that is in theleast or most query responses (LPF or MPF) • File with a smallest descriptor

  13. What to do with Probe Results? • Select terms from the result set to add to the local descriptor • Random (rand) • Weighted random (wrand) • Most frequent (mfreq) • Least frequent (lfreq) • Stop when local descriptor size limit is reached

  14. Experimental Setup Parameters Used in the Simulation: Query Length Distribution:

  15. Metrics • MRR (mean reciprocal rank) = • Precision = • Recall = A: set of replicas of the desired file. R: result set of the query.

  16. Data • TREC wt2g Web track. • Arbitrary set of 1000 Web docs from 37 Web domains. • Preprocessing • Stemming and removing html markup and stop words. • Final data set • 800,000 terms, 37,000 are unique.

  17. Experimental Results – Applying Probe Results to Local Descriptors MRR with Various Term Copying Techniques

  18. Experimental Results - Probe Triggering • No probing (base case). • Random • Assign each peer a probability of probing. • 5K probes are issued over the 10K queries. • T5K • Tune T to perform 5K probes over the 10K queries.

  19. Experimental Results - Probe Triggering (Cont’d) MRR: random + 20%; T5K +30%

  20. Experimental Results - Probe Triggering (Cont’d) Probing dramatically increases MRR of longer queries. Solve query over-specification problem.

  21. Experimental Results - Probe Triggering (Cont’d) Effect of Various Probing Rates on MRR.

  22. Experimental Results - Probe File Selection • Rand – randomly select a file to probe (base case). • LPF – least popular first. • Min query hits; on a tie, min descriptor size. • MPF – Most popular first. • Max query hits; on a tie, min descriptor size. • RR-LPF – round-robin-LPF. • Min probes; on a tie, LPF. • RR-MPF – round-robin-MPF. • Min probes; on a tie, MPF.

  23. Experimental Results - Probe File Selection (Cont’d) Compared with Rand base case, only RR-MPF has better performance (~10%) and lower cost (~-10%).

  24. Putting Them Together… Probing improves MRR by ~30% Probe: T5K, RR-MPF, wrand

  25. Explanation • Triggering: tuning T • Probes are issued by underactive peers. • File selection: • RR avoids same file being probed repeatedly • MPF improves peer’s ability of sharing popular files • Term copying: • wrand selects from bag of words in proportion of frequency • Allows new queries to be matched with a bias toward more strongly associated terms

  26. How to Control Cost? • Cost components: • Probe query results • File query results • Cost: avg number of responses per file query • Randomly sample each type of result on server side with a probability P • Impact on performance?

  27. Performance/Cost Analysis Total Per-file-query Cost for Different File and Probe Query Sampling Rates

  28. Performance/Cost Analysis (Cont’d) MRR is increased in all sampling settings

  29. Performance/Cost Analysis (Cont’d) Example: It can both reduce the cost (-15%) and improve the performance (18%) Cost Performance

  30. Conclusions and Future Work • Probing enriches data description • MRR is improved by 30% • Sampling is effective in controlling cost • Reduce cost by 15% and improve performance by 18% at the same time • We consider better ways of controlling cost

  31. Thank You! • Any Questions?

More Related