280 likes | 412 Views
SNFS: The design and implementation of a Social Network File System. Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras. Shameless plug. If interested, please check out eXO: Decentralized Autonomous Scalable Social Networking ,
E N D
SNFS: The design and implementationof a Social Network File System Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras
Shameless plug.. • If interested, please check out • eXO: Decentralized Autonomous Scalable Social Networking, • 5th Conference on Innovative Data Systems Research (CIDR2011), 2011.
Social Networks • Our Take: • Search for • People (friends, experts, …) • Content (books, photos, videos, blogs, websites, …) • Form entities (collections) • Friends-lists, content-libs • Search for • entities • Using previously-formed collections… • SNFS currently provides the foundation for these… Social Networks
Tagging • Profiles: • sets of tags describing entities. • “Search for”: • based on profiles. • Ranked retrieval (top-k) Tag 1 Tag 2 Tag 3 Tag 4 Tag 5
Current State 5,000,000,000 photos 3,000 photos/min (as of September 2010) 2,000,000,000 videos served up each day (May 2010) 600,000,000 monthly active users (January 2011) 15,000,000 books (October 2010) 130,000,000 by the end of the decade
Current State Need to access published content 22,750,000,000 queries in search engines 4,000,000,000 queries in YouTube 351,000,000 queries in Facebook 416,000,000 queries in MySpace (U.S. market figures, December 2009) ?
Current State How do I provide intresting objects to my users? How do I find stuff I want?
Proposal A content-aware file system for Social Network Systems Usefull to users... ... And service providers too!
Previous Work on File Indexing 1991 – Semantic File Systems by Gifford 1996 – BeFS by Giampaolo and Meurillon, part of the BeOS BeOS never had commercial success... 1998 – Indexing Service on Windows NT, not needed at the time Remnant of the Object File System from the unmaterialized Cairo project • Typically • no ranked retrieval • No users’ input (tags) • No user relationships
Desktop Searches 2004 – Windows Desktop Search, widely popular 2005... – Mac OS X's Spotlight, Google Desktop, Beagle, Strigi, Tracker... • Typically • no ranked retrieval ? • No user relationships • no exploits from relations for searching
Problems Power tools for power users... But for average users... Boolean operators??? SQL like queries???
Previous Work on Ranked Retrieval 1968 – SMART system by Salton, introduced weights in retrieval, instead of classical Boolean retrieval 1975 – Vectors and cosine similarity by Salton 1988 – Other functions for similarity tested and evaluated by Salton and Buckley 2003 – Fagin proposes and compares several efficient algorithms for top-k retrieval
Design – SNFS Tags are extracted from object, stemmed and frequency is counted Each object is associated with a unique id in a Tree Weights for each tag and document are calculated A tf-idf weighting scheme was chosen
Design – SNFS Term Weight and Object ID are stored in an inverted index Each posting list of the index is a B+Tree stored in secondary memory The position of the root of the B+Tree in the index is stored in a Red Black Tree
Design – Search and retrieval The query is split in terms and stemmed The score of each document is calculated using a threshold algorithm and a tf-idf function
Threshold Algorithms Input: Posting lists sorted on weight (decreasing) NRA (No Random Access) Algorithm Score Doc ID Doc ID d1 s1 t1 d1 d4 d2 s2 +s6 +s7 d2 t2 s3 +s8 d5 d3 d3 d2 s4 +s9 d4 t3 d2 d4 d3 s5 d5 depth 1 2 3 Threshold s1+s2+s3 t1 s4+s5+s6 s7+s8+s9 When no score bellow the top-k objects can be improved to exceed the threshold the algorithm halts
Threshold Algorithms Input: Posting lists sorted on weight (decreasing) TA (Threshold Algorithm with random accesses) Score Doc ID Doc ID d1 s1 t1 d1 d4 d2 d5 s2 +s6 +s7 d2 t2 s3 +s8 d5 d3 d3 d2 s4 +s9 d4 t3 d2 d4 d3 +s10 s5 d5 depth 1 2 3 Threshold s1+s2+s3 s4+s5+s6 s7+s8+s9 When score of the last object is bellow threshold the algorithm halts
Qualitative Comparison NRA TA Disk Accesses System Calls State Keeping and computation We expect TA to perform many more slow disk accesses Can NRA's large state keeping keeping and computation need overcome TA's disk accesses? We implement both, on hard disk and on RAM-disk to find out...
Testing - 4 real world test sets - files containing tags from online objects - index is normally on secondary memory - ram-disk used to evaluate the effect of disk accesses
Results demanded vs Time Disk based index TA NRA
Results demanded vs Time RAM based index TA NRA
Query Terms vs Time Disk based index TA NRA
Query Terms vs Time RAM based index TA NRA
Beagle vs NRA Terms vs time Results vs time
Conclusions SNFS: - Indexing, storage, and ranked retrievalof entities in a SN. - Study of efficiencyof algorithms and implementations, using real-world data, and various implementations. - Competitive performance, (eg against Beagle). - Many ways of further expansion
Future Work - Expansion for distributed systems and clouds - Distributed file systems (HDFS) - Distributed data structures - Tagging, Indexing, and searching for entity-collections – straightforward, as our ‘object’ implementation/abstraction captures this. • Establishing entities consisting of relationships between entities, using advanced-tagging, and searching for these…