90 likes | 212 Views
Accelerating search engine query processing using the GPU. Sudhanshu Khemka. Prominent Document Scoring M odels. The Vector Space Model. Treats each document as a vector with one component corresponding to each term in the dictionary
E N D
Accelerating search engine query processing using the GPU SudhanshuKhemka
The Vector Space Model • Treats each document as a vector with one component corresponding to each term in the dictionary • Weight of a component is calculated using the tf-idf weighing scheme where tf is the total number of occurrences of the term in the document, while idf is the inverse document frequency of the term. • As the query is also a mini document, the model represents the query as a vector. • Similarity between two vectors can be found as follows:
The Language model based approach to IR • Builds a probabilistic language model for each document d and ranks documents based on P(d|q) • Formula is simplified using Bayes rule: • P(d|q) = • P(q) is same for all documents and P(d) is treated as uniform across all documents. Thus, P(d|q) = P(q|d) • P(q|d) can be found using number of different methods. For example, using the Maximum likelihood estimate and the unigram assumption:
Lot of research has been done to develop efficient algorithms for the CPU that improve query response time • We look at the task of improving the query response time from a different perspective • Instead of just focusing on writing efficient algorithms for the CPU, we shift our focus to the processor and formulate the following question: • “Can we accelerate search engine query processing • using the GPU?”
Why the GPU? • GPU’s programming model highly suitable for processing data in parallel • Allows programmers to define a grid of thread blocks. Each thread in a thread block can execute a subset of the operations in parallel: • Useful for information retrieval as the score of each document can be computed in parallel.
Past work done • Ding et.al. in their paper, “Using Graphics Processors for High Performance IR Query Processing,” implement a variant of the vector space model , the Okapi BM25, on the GPU and demonstrate promising results. • Okapi BM25: • In particular, they provide data parallel algorithms for inverted list intersection, list compression, and top k scoring.
My contribution • Propose an efficient implementation of the second ranking model, the LM based approach to document scoring, on the GPU • Method: • Apply a divide and conquer approach as need to compute P(q|d) for each document in the collection • Each block in the GPU would calculate the score of a subset of the total documents, sort the scores, and transfer the results to an array in the global memory of the GPU • After all the blocks have written the sorted scores to the array in the global memory, we would use a Parallel merge algorithm to merge the results and return the top k results. • Satish et. al., in their paper “Designing Efficient Sorting Algorithms for ManycoreGPUs,” provide an efficient implementation of merge sort that is the fastest among all other implementations in the literature.