1 / 23

Integrating Semantics-Based Access Mechanisms with P2P File Systems

This paper presents a system design for integrating semantics-based access mechanisms into P2P file systems, addressing the limitations of current DHT-based systems. It introduces semantic indexing and locating utilities, using locality-sensitive hashing and semantic extractors to enable efficient semantic search capabilities. Evaluation results demonstrate load distribution and performance improvements.

henkel
Download Presentation

Integrating Semantics-Based Access Mechanisms with P2P File Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu

  2. Outline • Background • System Design • Related Work • Conclusions

  3. Background • Current P2P file systems (e.g.,CFS and PAST) • Layering FS functionalities on a distributed hash table (DHT), e.g., chord, pastry • Do not support semantics-based access • Because DHTs support only exact-match lookups

  4. Motivation • A problem of DHT-based P2P file systems • Support only exact-match lookups given a file object identifier fileID • get(fileID): retrieves the file corresponding to the fileID • put(fileID, file): stores the file with the fileID as a DHT key

  5. Motivation • A challenge to P2P file systems • Provide convenient access to vast amount of information • E.g., provide semantics-based search capabilities to efficiently locate semantically close files for browsing and purging, etc.

  6. Targeted Application • Semantic search expressed in natural language. • Query: “locate files that might contains k1, k2 and k3” • *k1, k2 and k3 are three distinct keywords

  7. Targeted Application (Cont’d) • Or, a more useful search: • Query: “locate files similar to f1” • The querys result are materialized via semantic directories

  8. System Architecture • Extends a P2P file system to support semantics-based access • Major Components • Semantic Extractor Registry • Semantic Indexing and Locating Utility

  9. Peer node Index node File A File B Key=hash(contents of A) Key=hash(contents of B) Regular Indexing • Indexing • key=hash(keywords or contents) • put(key, file-location); get(key) • Will be mapped to different index nodes • A and B have different contents • Traditional hash functions try to be uniform and conflict free A and B are semantically close (but different) files

  10. Locality Sensitive Hashing • A family of hash functions F is locality sensitive if hF operating on two sets A and B, we have:P hF [h(A)=h(B)] = sim(A,B) • Min-wise independent permutations are LSH Similarity function

  11. Semantic Indexing Peer node Index node File A • Using locality-sensitive hashing functions • A & B are likely (say with 60% chance) to indexed to the same index node • Similar contents are likely to generate the same hash result File B Key=hash(contents of A) Key=hash(contents of B) A and B are semantically close (but different) files

  12. Improving Semantic Indexing Peer node Index node File A • How to improve the likelihood that A & B are mapped together? • Using n (n>1) sets of semantic-hash functions • n index nodes • The more functions we use, the higher the likelihood • Probability of finding the file = 1 – (1-p)n • n normally is small (e.g., n<20) File B Key1=hash1(contents of A) Key1=hash1(contents of B) Key2=hash2(contents of A) Key2=hash2(contents of A) A and B are semantically close (but different) files

  13. System Architecture Application/User FS Extractor Registry Semantic Indexing and Locating Utility DHT Major components of the system architecture

  14. Semantic Extractor Registry • A set of semantic extractors • Leverage IR algorithms, VSM and LSI • Represent a file as a semantic vector (SV), typcially 200-300 keywords • Semantically close files have similar SVs

  15. Semantic Indexing • Given a file’s SV • Step 1: Drive a small number of semantic IDs (semIDs) from the SV using LSH • Step 2: Indexing the file by having these semIDs as the DHT keys • If two files are similar, some of their semIDs are likely to be the same

  16. Semantic Indexing • Using n groups of m hash functions • xor hash results within a group • Results: • The indice of semantically close files are hashed to the same peers with probability  1-(1-pm)n • P is expected to be high for semantically close files, so is the probability *p=sim(f1,f2), similarity between two files’s SVs

  17. Effects of n and m • Semantically close files are hashed to the same peers with probability  1-(1-pm)n • A big n would • Increase the probability • Increase the load of indexing / querying • A small m might • Increase the probability • Cluster the indices of dissimilar files to the same peers, affecting load-balancing

  18. Semantic Locating • Given a query’s SV • Step 1: Drive a small number of semIDs from the SV using LSH • Step 2: Locating those semantically close files by having these semIDs as the DHT keys • Goal: answer a query by consulting only a small number of peer nodes

  19. Evaluation • Load distribution of semantic indexing • Semantic indices per peer node • Performance of semantic locating • Percentage of semantically close files that can be located

  20. Semantic Indexing Number of file indexes per node Number of peer nodes Load distribution when the system indexes 10,000 files, n=20, m=5

  21. Semantic Indexing Number of file indexes per node Number of indexed files (x1000) Load distribution in a 1000 node system, n=20, m=5

  22. Perf. of Semantic Locating n percentage m [1] Apply n groups of m hash functions [2] Percentage of files located (128-byte fingerprint limit as a SV) [3] m and n determine the performance of semantic locating

  23. Conclusions • The first step to support semantics-based access in P2P file systems • LSH-based semantic indexing and locating approach • Impose small storage overhead (several MBs) • Efficiency: answer a query by consulting a small number of peers (e.g., 20) • Approximate results, but acceptable • Future work: query consistency and refinement, evaluation using IR workloads etc.

More Related