130 likes | 245 Views
Enhanced Hierarchical File System Indexer. Matthew Madson Evan Figueroa COGS 188. Hypothesis.
E N D
Enhanced Hierarchical File System Indexer Matthew Madson Evan Figueroa COGS 188
Hypothesis • Our claim is that by using the file system’s metadata (e.g. directory names) as additional information to format enhanced queries unbeknownst to the user, we can augment the precision of the user’s queries allowing for more relevant search results.
Our Test Corpus That’s 54,977 documents across ~120 files
Our Test Corpus Cont. • Top 5 (Re-indexed) • Class • Method • Object • Public • valu • Keywords • File noise words • Stem • Analyze with Snowball Analyzer • Top 5 • Use • Class • Method • Object • File
Technology Used • Lucene • Lucene Snowball Analyzer • Google Collections • PDFRenderer • PDF Box • CHMDeco & Istorage
Assessment • Compare relevance between 4 query formats: • Baseline Query: • Contents: non-path terms & path terms • Boosted Path Baseline Query: • Contents: path terms Contents: non-path terms^2.0
Assessment Cont. • Boosted Path Query (Non-Baseline): • Parsed-path: query terms Contents: path terms Contents: non-path terms^2.0 • Path Query (Non-Baseline): • Parsed-path: query terms Contents: query terms
How? Run all 4 query formats with the same test query. Take the top 10 results for each test query and put them in a bag Shuffle the bag and remove duplicates Assess (Yes / No) if the result is relevant to your test query Reverse the process to see how well the queries did.