190 likes | 273 Views
Connections : Using Context to Enhance File Search, from SOSP ‘05. Russell Greenspan CS 523 April, 2006. File System 1.0. Folder-based Too many files to effectively organize into folders in meaningful way
E N D
Connections: Using Context to Enhance File Search, from SOSP ‘05 Russell Greenspan CS 523 April, 2006
File System 1.0 • Folder-based • Too many files to effectively organize into folders in meaningful way • Files in folders do not match how they are interrelated, i.e. folder containing papers read in this course • Attribute-based • But... users are unwilling to manually assign attributes to files
Content-based File System • Indexed • Keep full inverted indices where every occurrence of every word is indexed with a pointer to the document that contains it and the location of the word • Return results of all occurrences of the search term(s) • Index size is comparable to document collection size
Content-based File System • Ranked results • Use inverted indices (as opposed to full inverted indices) to store documents in which term appears, not position in document • To judge relevancy, use probabilistic methods such as term frequency within a document and inverse document frequency over a collection
Content-based File System: Limitations • How to index binary data? • How to manipulate contextual details? • Google Desktop Search • No substring search • Only first 10,000 words of each document and first 100,000 documents are indexed, likely due to index size boundaries
Glimpse: Two-level Content Indexing • To deal with bloated indices, Glimpse offers a hybrid of full inverted indices and sequential search • Subdivide file space into manageable blocks, then index occurrence of terms within each block • Occurrences of the same term in the same block are stored only once, greatly reducing index size • On query, use index to find blocks with input search terms, then use sequential search tool like grep within the blocks
Connections:Context-based File System • Web-based context • “Authority” (nodes that link to sites) and “hub” nodes (nodes that are linked to often) • Web pages linked within a specified vicinity of other pages; a virtual neighborhood • How can context be applied in file systems?
ConnectionsArchitecture • Find “Temporal Locality” • Tracer • Sits at system call layer in kernel, monitoring file system and process management calls • Relation Graph • Stores graph of relationships between files • Nodes in graph are files, edges between nodes indicate a contextual relationship between files, with weight of edges indicating strength of relationship
Identifying Relationships • Relation window • Files accessed within a given window of time; too short a time might miss relationships, while too long a window connects unrelated files • Increment edge weight by 1 for duplicated operations • Do not re-increment weight if same input is in Relation window
Identifying Relationships • Operation Type • open – temporal relationship of files (accessed at nearby points in time) • read/write – causal relationship since data from read from file A can affect data later written to file B • all-ops – input is source file of mmap, stat, dup, link, or rename operation, output is destination of dup, link, or rename.
Identifying Relationships – Relation Graph Example • open(A), open(B) A B • read(A), write(C) A B C • dup(C, D), read(A), write(D) A B D C
Context-based Search • Take results of content-based search (e.g. Indri) • For each file in results, perform breadth-first search starting at file’s node; store all nodes touched in separate subgraph • Limit path length to ensure relevant files • Limit edge weight so frequently accessed files are only considered by most-relevant files
Ranking Results • If a file is rarely used in association with content-matched files, we want it to receive a lower ranking • Take node’s content-matched ranking and augment with contextual relationships from Relation graph
Basic-BFS 5 • Content-based rankings: A=4,B=1,C=0,D=2 • Consider node D • Update D’s rankval with rankvals of incoming edges, using percentage of total of D’s incoming edge weights that each represents • For example, A->D = (2/10) * 4 (A’s rankval) • Repeat to get total weight pushed to each node from all contextual relationships A B 1 2 D 8 C
Ranking Results • HITS algorithm • Improve ranking of “authorities”, nodes linked to many times, and “hubs”, nodes with many links • PageRank • Rank by the probability of reaching a particular node on a random walk of the graph (Google’s ranking algorithm)
Evaluation • Compared to content-only ranking (via Indri) • Recall (reducing false positives) increased from 13% to 22% for top-10 and 34% to 74% overall • Precision (reducing false negatives) increased from 23% to 29% for top-10 and 15% to 16% overall • Best precision from: • Read/write filter • Path length of 3
Performance • Background service requires on average 23 seconds per day to merge trace results into Relation Graphs • On average, index size is less than 1% of data set size • Queries execute in on average 2.62 seconds (0.98s for content search and 1.64s for context search)
Discussion • Other applicable context information? Applications, user personalization • Deleted files: should they be left in? • Network file access? • Implement closer to the kernel?Can better handle renamed files, organize virtual directory structure, assign attributes
References • C. Soules and G. Ganger. Connections: Using Context to Enhance File Search. Symposium on Operating System Principles, October 2005. • C. Soules and G. Ganger. Why Can't I Find My Files? New Methods for Automating Attribute Assignment. 9th Workshop on Hot Topics in Operating Systems (HotOS IX) May 2003. • D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at TREC 2004: terabyte track. Text Retrieval Conference, 2004. • U. Manber and S. Wu. GLIMPSE: a tool to search through entire file systems. Winter USENIX Technical Conference, pages 23–32. USENIX Association, 1994.