280 likes | 612 Views
Efficient Processing of Top-k Spatial Keyword Queries. João B. Rocha-Junior , Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg. Outline. Top-k spatial keyword queries Current approaches Spatial inverted index Single-keyword queries Multiple-keyword queries Experimental evaluation
E N D
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg SSTD 2011 - Minneapolis, Minnesota, USA
Outline SSTD 2011 - Minneapolis, Minnesota, USA Top-k spatial keyword queries Current approaches Spatial inverted index Single-keyword queries Multiple-keyword queries Experimental evaluation Conclusion
Motivation SSTD 2011 - Minneapolis, Minnesota, USA • More and more documents in the Internet are being associated with a spatial location • Ex: tweets, images (Flickr), Wikipedia sites, OpenStreetMap objects,… • Most of these geotagged objects are associated with a text (description)
Top-k spatial keyword queries • Query • Spatial location • Query keywords • Returns the k best spatio-textual objects ranked in terms of both • Spatial distance to the query location • Textual relevance to the query keywords Italian food SSTD 2011 - Minneapolis, Minnesota, USA
Another example… • Query • Spatial location • Query keywords • Returns the k best spatio-textual objects ranked in terms of both • Spatial distance to the query location • Textual relevance to the query keywords q distance query location objects SSTD 2011 - Minneapolis, Minnesota, USA
Ranking objects SSTD 2011 - Minneapolis, Minnesota, USA Score The spatial proximity (δ) is the normalized Euclidean distance between pand q The textual relevance (θ) is the cosine similarity between the description of p and the query keywords The query preference parameter (α) defines the importance of one measure over the other
Current approaches [1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. [2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for geographic document search”, TKDE, 2010. SSTD 2011 - Minneapolis, Minnesota, USA • Employ a modified R-tree [1,2] • Each node keeps an abstract document representing all documents in the node sub-tree • Abstract document • Pairs (term, weight), one pair per term • The weight permits computing an upper-bound score for the objects in the node sub-tree
Example q pub rock q bar:1 pop:2 pub:1 rock:1 pop:1 pub:1 samba:1 bar:2 pub:2 samba:1 bar:2 pop:2 pub:1 rock:1 samba:1 root: e1 e2 e3 e1 e1 e2 p1 p1 p2 p2 p3 p3 e2: e3: e1: e1: p5 p7 p4 p6 e3 pub:1 pub:2 pub:1 rock:1 For simplicity, we assume that the impact of a term is defined by the frequency SSTD 2011 - Minneapolis, Minnesota, USA
Current approaches SSTD 2011 - Minneapolis, Minnesota, USA • There are several variations • Incorporating document similarity • Clustering the nodes • Main problems • Frequent and infrequent terms are stored in the same way (have the same cost) • Accesses several nodes due to text dimensionality • Complex management of inverted files and/or vectors, one per node
Spatial inverted index (S2I) SSTD 2011 - Minneapolis, Minnesota, USA • Similarly to an inverted index, S2I maps terms to objects that contain the term • The most frequent terms are stored in aggregated R-trees (aR-trees) • The less frequent terms are stored in blocks in a file • The aR-tree permits accessing the objects in decreasing order of term relevance • The blocks permits storing the less frequent terms efficiently
Distribution of terms Frequency Terms SSTD 2011 - Minneapolis, Minnesota, USA The distribution of terms is very skewed Few hundred terms take up 50% of the text
Example SSTD 2011 - Minneapolis, Minnesota, USA
Aggregated R-tree (max) for frequent terms (e.g., pub) Max value • Only relevant objects are evaluated • The objects are accessed in decreasing order of score e2(2) e1(1) e0: e1 , max=1 e0 Term impact q p5(2) p6(2) p7(1) p1(1) p2(1) e2: e1: , max=2 e2 SSTD 2011 - Minneapolis, Minnesota, USA
Single-keyword queries SSTD 2011 - Minneapolis, Minnesota, USA • Only a single block or tree is accessed • Block • All the objects are read and the k best are reported • Tree • The nodes are accessed in decreasing order of score • The algorithm terminates when the score of the k-th object is higher than the score of any unvisited node
Example, processing top-1 e1(1) e2(2) e1 , max=1 e0 e0: Minimum distance q p5(2) p6(2) p7(1) p1(1) p2(1) e1: e2: , max=2 e2 Max-heap: <e2, e1> Max-heap: <e1> Max-heap: <p5, p6, e1, p7> Top-1 SSTD 2011 - Minneapolis, Minnesota, USA
Multiple-keyword queries Partial score SSTD 2011 - Minneapolis, Minnesota, USA • Requires aggregating the partial scores of the objects for each term t of the query keywords • Similar to Fagin’s algorithm (NRA) • Different bounds • Score:
Multiple-keyword algorithm SSTD 2011 - Minneapolis, Minnesota, USA • For each term t in q, access the objects p in S2I in decreasing of partial score • The objects are retrieved from a tree or block • Update the lower bound score of p • Sum of the partial scores know plus the lowest possible partial score (using the spatial distance) • Update the upper bound score of the visited objects • Return the objects whose lower bond score cannot be overcame by the remaining objects
Experimental evaluation • [1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. SSTD 2011 - Minneapolis, Minnesota, USA We compare our approach (S2I) with the DIR-tree proposed by Cong et al. [1] Both approaches are implemented in Java Measures: response time, I/O, update time, and index size Size of tree nodes and blocks: 4KB
Datasets SSTD 2011 - Minneapolis, Minnesota, USA
Variables studied SSTD 2011 - Minneapolis, Minnesota, USA • Number of results • 10, 20, 30, 40, 50 • Number of query keywords • 1, 2, 3, 4, and 5 • Query preference rate (α) • 0.1, 0.3, 0.5, 0.7, 0.9 • Scalability (twitter dataset) • 1M, 2M, 3M, 4M
Number of results (k) SSTD 2011 - Minneapolis, Minnesota, USA • The response time of S2I is one order of magnitude better due to less disk accesses • DIR-tree reads several nodes before finding the top-k due to text dimensionality
Number of query keywords SSTD 2011 - Minneapolis, Minnesota, USA One order of magnitude better in I/O and response time
Insertion time and index size SSTD 2011 - Minneapolis, Minnesota, USA S2I does not require updating inverted files (and vectors), and computing document similarity S2I requires more space
Conclusions SSTD 2011 - Minneapolis, Minnesota, USA • Top-k spatial keyword queries are intuitive and have several applications • We propose a new index • Terms with different frequency are stored differently • We propose algorithms to single- and multiple- keyword queries • The efficiency of our approach is verified through experiments on synthetic and real datasets
Thanks! SSTD 2011 - Minneapolis, Minnesota, USA More information… João B. Rocha-Junior joao@idi.ntnu.no http://www.idi.ntnu.no/~joao
Scalability SSTD 2011 - Minneapolis, Minnesota, USA S2I improvement over DIR-tree increases with cardinality of the datasets
Different datasets SSTD 2011 - Minneapolis, Minnesota, USA The advantage of S2I over DIR-tree is higher for datasets with few terms per documents
Terms removal SSTD 2011 - Minneapolis, Minnesota, USA • Terms with length=1 • Terms that have no letter character • ! Character.isLetter(token.charAt(i))