Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems

Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems Presented by: Dongmei Jia Illinois Institute of Technology April 11, 2008 D. Jia, W. G. Yee, L. T. Nguyen, O. Frieder. Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems. In Proc. of the 7th IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), Ireland, Sept. 2007

Outline • Objective • Problem • Proposed Approach • Experimental Results • Conclusions

Objective • To improve the accuracy of search in P2P file-sharing systems. • Finding poorly described data

Problem Statement • Characteristics: • Binary files (e.g., music file). • Each replica described with a descriptor. • Sparse. • Vary across peers. • Queries are conjunctive. • Problem: poor/sparse description makes files hard to match with queries!

Approach • Peers independently search the network for other descriptors of local files • Incorporate them into the local replica’s descriptor • Search implemented by “probe” queries

Example Peer2 Peer1 Peer3 Two descriptors of File F: D1, D2. Q = {Mozart, piano}. D1 = {Mozart} D1’ = {Mozart, piano} D3 = {Mozart, piano} tell me your description of F Q = {Mozart, piano} D1’ = {Mozart, piano} tell me your description of F No result returned for Q! D2 = {piano} Q = {Mozart, piano} D2 = {piano}

How P2P File-Sharing Systems Work • Peers share a set of files. • Each replica of a file is identified by a descriptor. • Every descriptor contains a unique hash key (MD5) identifying the file. • Query is routed to all reachable peers. • Query result contains its descriptor and the identity of the source server.

Probe Query Design • Contains one term: the key of a file. • Matches all the replicas of the file reached by the probe query.

Design Challenges • When to probe • What file to probe • What to do with probe results • How to control cost • Do this in a fully distributed way

When to Probe? • When a peer is not busy and under-utilized • Measured by number of responses returned Nr • When a peer has a high desire to participate • Measured by number of files published Nf • When the system is active • Measured by number of queries received Nq

When to Probe? (Cont’d) • Triggering mechanism: T > Nr/NfNq + NpT, Nf, Nq > 0 Where T: user-defined threshold Np: number of probe queries performed Nr/NfNq: number of results returned per shared file per incoming query • All the metrics are locally maintained by each peer, easy to implement

What File to Probe? • Goal is to increase participation level • Criteria to choose from: • File that is least probed (RR) • File that is in theleast or most query responses (LPF or MPF) • File with a smallest descriptor

What to do with Probe Results? • Select terms from the result set to add to the local descriptor • Random (rand) • Weighted random (wrand) • Most frequent (mfreq) • Least frequent (lfreq) • Stop when local descriptor size limit is reached

Experimental Setup Parameters Used in the Simulation: Query Length Distribution:

Metrics • MRR (mean reciprocal rank) = • Precision = • Recall = A: set of replicas of the desired file. R: result set of the query.

Data • TREC wt2g Web track. • Arbitrary set of 1000 Web docs from 37 Web domains. • Preprocessing • Stemming and removing html markup and stop words. • Final data set • 800,000 terms, 37,000 are unique.

Experimental Results – Applying Probe Results to Local Descriptors MRR with Various Term Copying Techniques

Experimental Results - Probe Triggering • No probing (base case). • Random • Assign each peer a probability of probing. • 5K probes are issued over the 10K queries. • T5K • Tune T to perform 5K probes over the 10K queries.

Experimental Results - Probe Triggering (Cont’d) MRR: random + 20%; T5K +30%

Experimental Results - Probe Triggering (Cont’d) Probing dramatically increases MRR of longer queries. Solve query over-specification problem.

Experimental Results - Probe Triggering (Cont’d) Effect of Various Probing Rates on MRR.

Experimental Results - Probe File Selection • Rand – randomly select a file to probe (base case). • LPF – least popular first. • Min query hits; on a tie, min descriptor size. • MPF – Most popular first. • Max query hits; on a tie, min descriptor size. • RR-LPF – round-robin-LPF. • Min probes; on a tie, LPF. • RR-MPF – round-robin-MPF. • Min probes; on a tie, MPF.

Experimental Results - Probe File Selection (Cont’d) Compared with Rand base case, only RR-MPF has better performance (~10%) and lower cost (~-10%).

Putting Them Together… Probing improves MRR by ~30% Probe: T5K, RR-MPF, wrand

Explanation • Triggering: tuning T • Probes are issued by underactive peers. • File selection: • RR avoids same file being probed repeatedly • MPF improves peer’s ability of sharing popular files • Term copying: • wrand selects from bag of words in proportion of frequency • Allows new queries to be matched with a bias toward more strongly associated terms

How to Control Cost? • Cost components: • Probe query results • File query results • Cost: avg number of responses per file query • Randomly sample each type of result on server side with a probability P • Impact on performance?

Performance/Cost Analysis Total Per-file-query Cost for Different File and Probe Query Sampling Rates

Performance/Cost Analysis (Cont’d) MRR is increased in all sampling settings

Performance/Cost Analysis (Cont’d) Example: It can both reduce the cost (-15%) and improve the performance (18%) Cost Performance

Conclusions and Future Work • Probing enriches data description • MRR is improved by 30% • Sampling is effective in controlling cost • Reduce cost by 15% and improve performance by 18% at the same time • We consider better ways of controlling cost

Thank You! • Any Questions?

Distributed, Automatic File Description Tuning in Peer-to-Peer File-Sharing Systems