1 / 28

Efficient Processing of Top-k Spatial Keyword Queries

Efficient Processing of Top-k Spatial Keyword Queries. João B. Rocha-Junior , Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg. Outline. Top-k spatial keyword queries Current approaches Spatial inverted index Single-keyword queries Multiple-keyword queries Experimental evaluation

ince
Download Presentation

Efficient Processing of Top-k Spatial Keyword Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg SSTD 2011 - Minneapolis, Minnesota, USA

  2. Outline SSTD 2011 - Minneapolis, Minnesota, USA Top-k spatial keyword queries Current approaches Spatial inverted index Single-keyword queries Multiple-keyword queries Experimental evaluation Conclusion

  3. Motivation SSTD 2011 - Minneapolis, Minnesota, USA • More and more documents in the Internet are being associated with a spatial location • Ex: tweets, images (Flickr), Wikipedia sites, OpenStreetMap objects,… • Most of these geotagged objects are associated with a text (description)

  4. Top-k spatial keyword queries • Query • Spatial location • Query keywords • Returns the k best spatio-textual objects ranked in terms of both • Spatial distance to the query location • Textual relevance to the query keywords Italian food SSTD 2011 - Minneapolis, Minnesota, USA

  5. Another example… • Query • Spatial location • Query keywords • Returns the k best spatio-textual objects ranked in terms of both • Spatial distance to the query location • Textual relevance to the query keywords q distance query location objects SSTD 2011 - Minneapolis, Minnesota, USA

  6. Ranking objects SSTD 2011 - Minneapolis, Minnesota, USA Score The spatial proximity (δ) is the normalized Euclidean distance between pand q The textual relevance (θ) is the cosine similarity between the description of p and the query keywords The query preference parameter (α) defines the importance of one measure over the other

  7. Current approaches [1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. [2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for geographic document search”, TKDE, 2010. SSTD 2011 - Minneapolis, Minnesota, USA • Employ a modified R-tree [1,2] • Each node keeps an abstract document representing all documents in the node sub-tree • Abstract document • Pairs (term, weight), one pair per term • The weight permits computing an upper-bound score for the objects in the node sub-tree

  8. Example q pub rock q bar:1 pop:2 pub:1 rock:1 pop:1 pub:1 samba:1 bar:2 pub:2 samba:1 bar:2 pop:2 pub:1 rock:1 samba:1 root: e1 e2 e3 e1 e1 e2 p1 p1 p2 p2 p3 p3 e2: e3: e1: e1: p5 p7 p4 p6 e3 pub:1 pub:2 pub:1 rock:1 For simplicity, we assume that the impact of a term is defined by the frequency SSTD 2011 - Minneapolis, Minnesota, USA

  9. Current approaches SSTD 2011 - Minneapolis, Minnesota, USA • There are several variations • Incorporating document similarity • Clustering the nodes • Main problems • Frequent and infrequent terms are stored in the same way (have the same cost) • Accesses several nodes due to text dimensionality • Complex management of inverted files and/or vectors, one per node

  10. Spatial inverted index (S2I) SSTD 2011 - Minneapolis, Minnesota, USA • Similarly to an inverted index, S2I maps terms to objects that contain the term • The most frequent terms are stored in aggregated R-trees (aR-trees) • The less frequent terms are stored in blocks in a file • The aR-tree permits accessing the objects in decreasing order of term relevance • The blocks permits storing the less frequent terms efficiently

  11. Distribution of terms Frequency Terms SSTD 2011 - Minneapolis, Minnesota, USA The distribution of terms is very skewed Few hundred terms take up 50% of the text

  12. Example SSTD 2011 - Minneapolis, Minnesota, USA

  13. Aggregated R-tree (max) for frequent terms (e.g., pub) Max value • Only relevant objects are evaluated • The objects are accessed in decreasing order of score e2(2) e1(1) e0: e1 , max=1 e0 Term impact q p5(2) p6(2) p7(1) p1(1) p2(1) e2: e1: , max=2 e2 SSTD 2011 - Minneapolis, Minnesota, USA

  14. Single-keyword queries SSTD 2011 - Minneapolis, Minnesota, USA • Only a single block or tree is accessed • Block • All the objects are read and the k best are reported • Tree • The nodes are accessed in decreasing order of score • The algorithm terminates when the score of the k-th object is higher than the score of any unvisited node

  15. Example, processing top-1 e1(1) e2(2) e1 , max=1 e0 e0: Minimum distance q p5(2) p6(2) p7(1) p1(1) p2(1) e1: e2: , max=2 e2 Max-heap: <e2, e1> Max-heap: <e1> Max-heap: <p5, p6, e1, p7> Top-1 SSTD 2011 - Minneapolis, Minnesota, USA

  16. Multiple-keyword queries Partial score SSTD 2011 - Minneapolis, Minnesota, USA • Requires aggregating the partial scores of the objects for each term t of the query keywords • Similar to Fagin’s algorithm (NRA) • Different bounds • Score:

  17. Multiple-keyword algorithm SSTD 2011 - Minneapolis, Minnesota, USA • For each term t in q, access the objects p in S2I in decreasing of partial score • The objects are retrieved from a tree or block • Update the lower bound score of p • Sum of the partial scores know plus the lowest possible partial score (using the spatial distance) • Update the upper bound score of the visited objects • Return the objects whose lower bond score cannot be overcame by the remaining objects

  18. Experimental evaluation • [1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. SSTD 2011 - Minneapolis, Minnesota, USA We compare our approach (S2I) with the DIR-tree proposed by Cong et al. [1] Both approaches are implemented in Java Measures: response time, I/O, update time, and index size Size of tree nodes and blocks: 4KB

  19. Datasets SSTD 2011 - Minneapolis, Minnesota, USA

  20. Variables studied SSTD 2011 - Minneapolis, Minnesota, USA • Number of results • 10, 20, 30, 40, 50 • Number of query keywords • 1, 2, 3, 4, and 5 • Query preference rate (α) • 0.1, 0.3, 0.5, 0.7, 0.9 • Scalability (twitter dataset) • 1M, 2M, 3M, 4M

  21. Number of results (k) SSTD 2011 - Minneapolis, Minnesota, USA • The response time of S2I is one order of magnitude better due to less disk accesses • DIR-tree reads several nodes before finding the top-k due to text dimensionality

  22. Number of query keywords SSTD 2011 - Minneapolis, Minnesota, USA One order of magnitude better in I/O and response time

  23. Insertion time and index size SSTD 2011 - Minneapolis, Minnesota, USA S2I does not require updating inverted files (and vectors), and computing document similarity S2I requires more space

  24. Conclusions SSTD 2011 - Minneapolis, Minnesota, USA • Top-k spatial keyword queries are intuitive and have several applications • We propose a new index • Terms with different frequency are stored differently • We propose algorithms to single- and multiple- keyword queries • The efficiency of our approach is verified through experiments on synthetic and real datasets

  25. Thanks! SSTD 2011 - Minneapolis, Minnesota, USA More information… João B. Rocha-Junior joao@idi.ntnu.no http://www.idi.ntnu.no/~joao

  26. Scalability SSTD 2011 - Minneapolis, Minnesota, USA S2I improvement over DIR-tree increases with cardinality of the datasets

  27. Different datasets SSTD 2011 - Minneapolis, Minnesota, USA The advantage of S2I over DIR-tree is higher for datasets with few terms per documents

  28. Terms removal SSTD 2011 - Minneapolis, Minnesota, USA • Terms with length=1 • Terms that have no letter character • ! Character.isLetter(token.charAt(i))

More Related