230 likes | 242 Views
This paper presents a system design for integrating semantics-based access mechanisms into P2P file systems, addressing the limitations of current DHT-based systems. It introduces semantic indexing and locating utilities, using locality-sensitive hashing and semantic extractors to enable efficient semantic search capabilities. Evaluation results demonstrate load distribution and performance improvements.
E N D
Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu
Outline • Background • System Design • Related Work • Conclusions
Background • Current P2P file systems (e.g.,CFS and PAST) • Layering FS functionalities on a distributed hash table (DHT), e.g., chord, pastry • Do not support semantics-based access • Because DHTs support only exact-match lookups
Motivation • A problem of DHT-based P2P file systems • Support only exact-match lookups given a file object identifier fileID • get(fileID): retrieves the file corresponding to the fileID • put(fileID, file): stores the file with the fileID as a DHT key
Motivation • A challenge to P2P file systems • Provide convenient access to vast amount of information • E.g., provide semantics-based search capabilities to efficiently locate semantically close files for browsing and purging, etc.
Targeted Application • Semantic search expressed in natural language. • Query: “locate files that might contains k1, k2 and k3” • *k1, k2 and k3 are three distinct keywords
Targeted Application (Cont’d) • Or, a more useful search: • Query: “locate files similar to f1” • The querys result are materialized via semantic directories
System Architecture • Extends a P2P file system to support semantics-based access • Major Components • Semantic Extractor Registry • Semantic Indexing and Locating Utility
Peer node Index node File A File B Key=hash(contents of A) Key=hash(contents of B) Regular Indexing • Indexing • key=hash(keywords or contents) • put(key, file-location); get(key) • Will be mapped to different index nodes • A and B have different contents • Traditional hash functions try to be uniform and conflict free A and B are semantically close (but different) files
Locality Sensitive Hashing • A family of hash functions F is locality sensitive if hF operating on two sets A and B, we have:P hF [h(A)=h(B)] = sim(A,B) • Min-wise independent permutations are LSH Similarity function
Semantic Indexing Peer node Index node File A • Using locality-sensitive hashing functions • A & B are likely (say with 60% chance) to indexed to the same index node • Similar contents are likely to generate the same hash result File B Key=hash(contents of A) Key=hash(contents of B) A and B are semantically close (but different) files
Improving Semantic Indexing Peer node Index node File A • How to improve the likelihood that A & B are mapped together? • Using n (n>1) sets of semantic-hash functions • n index nodes • The more functions we use, the higher the likelihood • Probability of finding the file = 1 – (1-p)n • n normally is small (e.g., n<20) File B Key1=hash1(contents of A) Key1=hash1(contents of B) Key2=hash2(contents of A) Key2=hash2(contents of A) A and B are semantically close (but different) files
System Architecture Application/User FS Extractor Registry Semantic Indexing and Locating Utility DHT Major components of the system architecture
Semantic Extractor Registry • A set of semantic extractors • Leverage IR algorithms, VSM and LSI • Represent a file as a semantic vector (SV), typcially 200-300 keywords • Semantically close files have similar SVs
Semantic Indexing • Given a file’s SV • Step 1: Drive a small number of semantic IDs (semIDs) from the SV using LSH • Step 2: Indexing the file by having these semIDs as the DHT keys • If two files are similar, some of their semIDs are likely to be the same
Semantic Indexing • Using n groups of m hash functions • xor hash results within a group • Results: • The indice of semantically close files are hashed to the same peers with probability 1-(1-pm)n • P is expected to be high for semantically close files, so is the probability *p=sim(f1,f2), similarity between two files’s SVs
Effects of n and m • Semantically close files are hashed to the same peers with probability 1-(1-pm)n • A big n would • Increase the probability • Increase the load of indexing / querying • A small m might • Increase the probability • Cluster the indices of dissimilar files to the same peers, affecting load-balancing
Semantic Locating • Given a query’s SV • Step 1: Drive a small number of semIDs from the SV using LSH • Step 2: Locating those semantically close files by having these semIDs as the DHT keys • Goal: answer a query by consulting only a small number of peer nodes
Evaluation • Load distribution of semantic indexing • Semantic indices per peer node • Performance of semantic locating • Percentage of semantically close files that can be located
Semantic Indexing Number of file indexes per node Number of peer nodes Load distribution when the system indexes 10,000 files, n=20, m=5
Semantic Indexing Number of file indexes per node Number of indexed files (x1000) Load distribution in a 1000 node system, n=20, m=5
Perf. of Semantic Locating n percentage m [1] Apply n groups of m hash functions [2] Percentage of files located (128-byte fingerprint limit as a SV) [3] m and n determine the performance of semantic locating
Conclusions • The first step to support semantics-based access in P2P file systems • LSH-based semantic indexing and locating approach • Impose small storage overhead (several MBs) • Efficiency: answer a query by consulting a small number of peers (e.g., 20) • Approximate results, but acceptable • Future work: query consistency and refinement, evaluation using IR workloads etc.