Integrating Semantics-Based Access Mechanisms with P2P File Systems

Integrating Semantics-Based Access Mechanisms with P2P File Systems Yingwu Zhu, Honghao Wang and Yiming Hu

Outline • Background • System Design • Related Work • Conclusions

Background • Current P2P file systems (e.g.,CFS and PAST) • Layering FS functionalities on a distributed hash table (DHT), e.g., chord, pastry • Do not support semantics-based access • Because DHTs support only exact-match lookups

Motivation • A problem of DHT-based P2P file systems • Support only exact-match lookups given a file object identifier fileID • get(fileID): retrieves the file corresponding to the fileID • put(fileID, file): stores the file with the fileID as a DHT key

Motivation • A challenge to P2P file systems • Provide convenient access to vast amount of information • E.g., provide semantics-based search capabilities to efficiently locate semantically close files for browsing and purging, etc.

Targeted Application • Semantic search expressed in natural language. • Query: “locate files that might contains k1, k2 and k3” • *k1, k2 and k3 are three distinct keywords

Targeted Application (Cont’d) • Or, a more useful search: • Query: “locate files similar to f1” • The querys result are materialized via semantic directories

System Architecture • Extends a P2P file system to support semantics-based access • Major Components • Semantic Extractor Registry • Semantic Indexing and Locating Utility

Peer node Index node File A File B Key=hash(contents of A) Key=hash(contents of B) Regular Indexing • Indexing • key=hash(keywords or contents) • put(key, file-location); get(key) • Will be mapped to different index nodes • A and B have different contents • Traditional hash functions try to be uniform and conflict free A and B are semantically close (but different) files

Locality Sensitive Hashing • A family of hash functions F is locality sensitive if hF operating on two sets A and B, we have:P hF [h(A)=h(B)] = sim(A,B) • Min-wise independent permutations are LSH Similarity function

Semantic Indexing Peer node Index node File A • Using locality-sensitive hashing functions • A & B are likely (say with 60% chance) to indexed to the same index node • Similar contents are likely to generate the same hash result File B Key=hash(contents of A) Key=hash(contents of B) A and B are semantically close (but different) files

Improving Semantic Indexing Peer node Index node File A • How to improve the likelihood that A & B are mapped together? • Using n (n>1) sets of semantic-hash functions • n index nodes • The more functions we use, the higher the likelihood • Probability of finding the file = 1 – (1-p)n • n normally is small (e.g., n<20) File B Key1=hash1(contents of A) Key1=hash1(contents of B) Key2=hash2(contents of A) Key2=hash2(contents of A) A and B are semantically close (but different) files

System Architecture Application/User FS Extractor Registry Semantic Indexing and Locating Utility DHT Major components of the system architecture

Semantic Extractor Registry • A set of semantic extractors • Leverage IR algorithms, VSM and LSI • Represent a file as a semantic vector (SV), typcially 200-300 keywords • Semantically close files have similar SVs

Semantic Indexing • Given a file’s SV • Step 1: Drive a small number of semantic IDs (semIDs) from the SV using LSH • Step 2: Indexing the file by having these semIDs as the DHT keys • If two files are similar, some of their semIDs are likely to be the same

Semantic Indexing • Using n groups of m hash functions • xor hash results within a group • Results: • The indice of semantically close files are hashed to the same peers with probability  1-(1-pm)n • P is expected to be high for semantically close files, so is the probability *p=sim(f1,f2), similarity between two files’s SVs

Effects of n and m • Semantically close files are hashed to the same peers with probability  1-(1-pm)n • A big n would • Increase the probability • Increase the load of indexing / querying • A small m might • Increase the probability • Cluster the indices of dissimilar files to the same peers, affecting load-balancing

Semantic Locating • Given a query’s SV • Step 1: Drive a small number of semIDs from the SV using LSH • Step 2: Locating those semantically close files by having these semIDs as the DHT keys • Goal: answer a query by consulting only a small number of peer nodes

Evaluation • Load distribution of semantic indexing • Semantic indices per peer node • Performance of semantic locating • Percentage of semantically close files that can be located

Semantic Indexing Number of file indexes per node Number of peer nodes Load distribution when the system indexes 10,000 files, n=20, m=5

Semantic Indexing Number of file indexes per node Number of indexed files (x1000) Load distribution in a 1000 node system, n=20, m=5

Perf. of Semantic Locating n percentage m [1] Apply n groups of m hash functions [2] Percentage of files located (128-byte fingerprint limit as a SV) [3] m and n determine the performance of semantic locating

Conclusions • The first step to support semantics-based access in P2P file systems • LSH-based semantic indexing and locating approach • Impose small storage overhead (several MBs) • Efficiency: answer a query by consulting a small number of peers (e.g., 20) • Approximate results, but acceptable • Future work: query consistency and refinement, evaluation using IR workloads etc.

Integrating Semantics-Based Access Mechanisms with P2P File Systems

Integrating Semantics-Based Access Mechanisms with P2P File Systems

Presentation Transcript

Integrating Shibboleth with Enterprise Identity and Access Management (IAM) Systems

Performance Issues in P2P File Sharing Systems

Peer-to-Peer (P2P) File Systems

P2P File sharing with JXTA

Integrating Self-Access with Curriculum An Activities based Approach

Building File Systems with

FRAC: Implementing Role-Based Access Control for Network File Systems

File Access

MP2: P2P File Server

Economics of P2P file-sharing systems

Applications with Random File Access

Incorporating Semantics with P2P Resource Distribution Networks

A Simulation Study of P2P File Pollution Prevention Mechanisms

Improving Data Access in P2P Systems

File Access

IETF P2P Mechanisms

Working Set-Based Access Control for Network File Systems

File Access

Standards-Based P2P Communications Systems