K-tree/forest: Efficient Indexes for Boolean Queries

K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma

Boolean queries • Alice and Bob -- Retrieve documents containing Bob and Alice • Alice orBob -- Retrieve documents containing either Bob or Alice or both • Alice and not Bob, … University of Houston

Existing solutions Query: Bob and Alice Inverted file • Retrieve inverted list (on disk) for Bob • Retrieve inverted list for Alice • Merge the lists to compute intersection, or • For “And” only: retrieve the shorter list and scan the docs (disk I/Os “saved?” at expense of CPU time) • Google times for query: Bob – 0.11s, Alice – 0.1s, Bob and Alice – 0.2s University of Houston

Existing solutionsQuery: Bob and Alice Build Secondary index on inverted lists • Retrieve secondary index on Bob’s list from disk (assuming secondary index on Bob’s list is smaller) • Search for Alice in secondary index • Retrieve documents University of Houston

K-tree (Leaves point to lists on disk) Alice 0 1 Bob Bob 0 0 1 1 University of Houston

Experiments • Data • 1 million word documents divided into pages of 100 words each • Pages indexed by keywords contained • Methods • BST-based inverted file using merge or scan technique • K-tree • Queries of type: • Single keyword • Two keywords “and/and-not’’ University of Houston

Results for single word query MethodI/O’s • BST-based inverted file 31.26 • K-tree (parallel) 25.36 • K-tree (sequential) 37.05 • K-tree (sequential with no fragmentation) 31.26 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before. University of Houston

Results for 2-words and query Method I/O’s • BST-based inverted file (merge) 62.52 • BST-based inverted file (scan) 10.13 • K-tree (parallel) 00.57 • K-tree (sequential) 00.77 • K-tree(sequential with no fragmentation) 00.61 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before. University of Houston

K-forest • Tradeoff: size of K-forest vs. post-processing • In general choose size of subset, s, by C(K,s)2s <= avail. Memory.K can be reduced by standard techniques and by considering frequency. Index on sub- sets of size 3 K-trees for 3 keywords University of Houston

K-tree highlights • Advantages: • And/But queries – no post processing • Or queries – require some K-tree traversal • Easy to implement • Easy to parallelize, especially for shorterand/and-not queries and allor queries • Disadvantage: • Size 2K for K keywords – but this is overkill since user queries are typically short (over 90% of queries contain at most 5 keywords). Very rare to have queries with 10 or more keywords. University of Houston

Conclusions and Future Work • We have presented efficient structures (K-tree/forest) for boolean queries • One direction is to do more experiments using for example TREC collections • Another direction is to study how document characteristics can help in choosing the ``right set of keywords’’ to include in these structures University of Houston

K-tree/forest: Efficient Indexes for Boolean Queries