A NEW TOPIC:QUERIES WITH GEO-INFORMATION

A NEW TOPIC:QUERIES WITH GEO-INFORMATION WEB&MOBILE GROUP Zheng Huo

SIX TOPICS RELATED • Spatial pattern mining Xiangmei Hu • Mining Interesting Locations and Travel Sequences from GPS Trajectories [WWW09] • WhereNext: a Location Predictor on Trajectory Pattern Mining [SIGKDD09] • Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets[SIGKDD09] • Social network Ruxia Ma • Opinion Jing Zhao • Rated Aspect Summarization of Short Comments [WWW09] • How Opinions are Received by Online Communities: A Case Study on Amazon.com Helpfulness Votes [WWW09] • OpinionMiner: A Machine Learning System for Web Opinion Mining and Extraction [SIGKDD09] • Geo+query intention Zheng Huo • Discovering Users' Specific Geo Intention in Web Search [WWW09] • A Probabilistic Topic-Based Ranking Framework for Location-Sensitive Domain Information Retrieval [SIGIR09] • Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects [VLDB09] • Keyword Search in Spatial Databases Towards Searching by Document [ICDE09] • Geographic + image • kNN applications 2/43

OUTLINE • Background • Overview • Methods • Sdir • GIU method • Top-k • Others • Conclusions & future work 3/43

BACKGROUND • Many web queries contain geo info • About 30% queries may have geo intent; about half of them have explicit geo info. • Such as queries like “Italian restaurant”, ”Car dealer”, ”L.A hotel” • About 13% queries have a place name • 84%of them have explicit cityinfo. • 2.6% have stateinfo. • 13.4% have countryinfo. • Can be used in many fields, such as • Recommendation System • Improve users’ search experience • Advertisement matching 4/43

Scores of “textual relevance” Scores of “Spatial relevance” Hybrid Score Ranking BACKGROUND(cont’) • Why traditional methods can’t solve this problem perfectly? 1. Spatial relevance is computed through “Euclidean Distance” which is not suitable for all the cases Q(Location, terms) 1. Use a linear function to combine them, which is not the best method 5/43

Local geo-info Neighborhood geo-info OVERVIEW Explicit Geo-information SDIR method Queries like ”Beijing Hotels” “Paris toggery” GIU methods Queries like “Italian Restaurant” “Dentist” Queries with Geo-information Top-k query Local info Spatial query Neighbor info Implicit Geo-information Queries like “Car dealer” “Real estate” Specific region Other…. Queries like “State Maps” “Hotels” 7/43

OUTLINE • Background • Overview • Methods • Sdir • GIU method • Top-k • Others • Conclusions & Future work 8/43

A TOPIC-BASED METHOD:SDIR A Piece of News: There is an NBA match review regarding the match between L.A. Lakers and Rockets (from Houston), in which some other teams such as Boston Celtics are mentioned Briefly. • An example q1 :“Los Angeles basketball game” Search engine or IR system q2 :“Houston basketball game” q3 :“Boston basketball game” Web pages & documents ……………… A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 9/43

SDIR(cont’) • Problem definition • DEFINITION 1. A spatial query is expressed as q = (qS, qT),in which qS represents the geographical condition implied by q and qT represents the search terms that exclude location names. • DEFINITION 2. When evaluated against spatial queries, a document can be viewed as d = (dS, dT), in which dS is the list of location names found in d and dTrepresents document texts. • We can define the ranking function as: F(q, d) = F(qT , qS, dT , dS) Assume that spatial relevance and textual relevance are independent, we can write it as F(q, d) = FT(qT , dT)⊕ FS(qS, dS) A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 10/43

SDIR(cont’) • Framework of SDIR(Spatial-related Domine Information Retrievel) Topic： A generalized abstraction of document contents Each NBA team is a topic Q-T Relevance ϕ(q, t), evaluate relevance between a query and a topic Topic Layer: In the middle of query layer and document layer, consists of topics D-T Relevance ψ(d, t), evaluate relevance between a document and a topic Topic Center： A location which the topic is about. For the team Rockets, Houston is topic center A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 11/43

SDIR(cont’) • Some formulas F(q, d) = FT(qT , dT)⊕ FS(qS, dS) F(q, d) =∑ϕ(q, tj)ψ(d, tj)ωtj (q, d) • 1.It worked directly between the query and • the document • 2. Popular IR metrics can be used • here, such as tf-idf and cosine function • 3. Here, the author used a extended version • of the tf-idf method ϕ(q,t)=p(t|q) ψ(d,t)=p(d|t) F(q, d) = ∑p(d|tj)p(tj |q)ωtj(q, d) Bayesian Theory F(q, d) ∝ ∑p(tj |qS)p(tj |qT)p(tj |dS)p(tj |dT)ωtj(q, d) / p( tj ) Obtained from topic model Can be directly obtained from the training set A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 12/43

SDIR(cont’) This method is domain-based, the author trained a model which domain is “NBA basketball games”. This is location related because most fans are interested in local teams • How to learn the topic model? Determine which domain you are focused on Topic documents :Crawl data from well supported web sites, including : NBA official site, ESPN , and Yahoo! Sport Funs : at least 10,000 geo-record for each team Data Collection Find the suitable distribution model 1.Returns probabilistic results for class labels, perfectly match ranking purpose. 2. GP is no parametric and does not place prior assumptions 3. GP is a kernel machine, which is highly flexible and configurable Use GP classifier to Model A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 13/43

SDIR(cont’) Geographical Influence Lookup Table: LTS divides the entire geo-area into small grids with the same sizes. • Procedure Overall LTS qS ϕ(q, tj) LTT Query (q) Term-Topic Lookup Table: for example, given m topics. qT qT ωtj (q, d) F(q, d) Document (d) Inverted Index ψ(d, tj) A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 14/43

SDIR(cont’) • Implementation • Data set: Take the NBA topic for example • Training set: Documents crawled from ESPN/NBA team pages are as labeled with corresponding teams. At least 10,000 records for each team. • Geo-Grid: cut the entire US main territory into smaller square grids, each of which is 0.2°×0.2° A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 15/43

SDIR(cont’) 2-team distributions 5-team distributions Celtics(+1) VS Bulls(-1) Celtics, Bulls, Rockets, Lakers, Suns A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 16/43

SDIR(cont’) Location: Simulate a user from 4 locations Query: “MVP” (implicit geo-info) Euclidean distance is not suitable for this. For people from Pitts prefer Boston to Cleveland although Cleveland is much nearer 17/43

SDIR(cont’) • Pros and cons • Highly ranking qualities on query with Geo-information. • Suitable for explicit and implicit geo queries. • BUT it is domain based, each topic model must be trained separately. • Topics must have only one center, can’t deal with multiple centers in one topic. A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’09 18/43

GIU METHOD • Overview of the system Discovering Users’ Specific Geo Intention in Web Search WWW’09 20/43

GIU METHOD(cont’) • Classifier1: detect implicit geo intent Q = w1 · · ·wn wi is the strings composed the query The probability of each word is conditioned on the identity of the previous word Use WOE tool For each city Ck, build bigram language model Discovering Users’ Specific Geo Intention in Web Search WWW’09 21/43

GIU METHOD(cont’) • City language model • Calculate the posterior probability Uniform distribution Attention! The city language is built. From now on when we related to a city, it means a city in the city language model, Not the geo one. If the probability is high, it means the query is related to this city instead of the meaning the query is generated from that city. Obtained from last formula Discovering Users’ Specific Geo Intention in Web Search WWW’09 22/43

GIU METHOD(cont’) • Overall data description • Three learning tasks • Classifier I: Detecting implicit geo queries • Classifier II: Discriminating different localization capabilities of geo queries: local geo intent, neighbor region geo intent, etc. • City language models: Predicting geo entities related to a query Discovering Users’ Specific Geo Intention in Web Search WWW’09 23/43

GIU METHOD(cont’) • Implementation • Use real world web search logs from Yahoo! • Training subset I • Randomly sample 20,000 implicit geo queries and 20,000 non-geo queries • All the explicit geo queries in the training set are used to generate the city language model(CLM) Discovering Users’ Specific Geo Intention in Web Search WWW’09 24/43

DN- DN+ GIU METHOD(cont’) • Generating labels Step1: get the clicked url for each query (domain name) Step 2: Identify queries in DN+ Randomly sample 20,000 implicit geo queries and 20,000 non-geo queries to train classifiers. 67 DNs in DN+, 64DNs in DN- Step 3: Identify queries in DN- Step 4: non-location parts of positive samples as the final implicit geo intent queries Discovering Users’ Specific Geo Intention in Web Search WWW’09 25/43

GIU METHOD(cont’) • Evaluate the classifiers Discovering Users’ Specific Geo Intention in Web Search WWW’09 26/43

Classifier II Implicit geo queries Discriminate LG, NRG, RG GIU METHOD(cont’) • Evaluating Classifier II LG NG RG The result of the classification formed training subset II Discovering Users’ Specific Geo Intention in Web Search WWW’09 27/43

Low dimensional features All features GIU METHOD(cont’) • Training models evaluation • The training data is the training subset II The classifiers classify the queries generated from city Level. The result of this step formed the training subset III / testing subset III. Discovering Users’ Specific Geo Intention in Web Search WWW’09 28/43

GIU METHOD(cont’) • Location-specific query discovery A threshold To tune ta with training subset III Discovering Users’ Specific Geo Intention in Web Search WWW’09 29/43

GIU METHOD(cont’) • Conclusions of GIU method WOE tool Detect the implicit geo intent, using a probability of the co-occurrence of a city and a query. CLM is generated here. Discriminate LG, NG and RG geo intention, predict the location of the entity in Q Discovering Users’ Specific Geo Intention in Web Search WWW’09 30/43

GIU METHOD(cont’) • Pros and cons • Can be used in explicit and implicit geo queries both. • Compared to topic-based method, GIU method is more flexible and useful. • BUT query log based method is constrained • The classifiers are not improved, the performance is not quite good. Discovering Users’ Specific Geo Intention in Web Search WWW’09 31/43

O1 O2 O5 O6 Q O3 O4 TOD-K • Introduction • Questions: • How to present location proximity and text relevancy? • What kind of index to combine both location proximity and text relevancy? Local geo info Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 33/43

TOP-K • A simple example Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 34/43

TOP-K • Hybrid index A IR-tree Objects & bounding recs Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 35/43

TOP-K • IR-tree algorithm front R7 R5,0.05119 R6,0.269 R2,0.1048 R6,0.269 R1,0.238 R1,0.238 R6,0.269 O4,0.517 O3,0.481 O8,0.686 O1,0.238 R6,0.269 O3,0.481 O4,0.517 O8,0.686 O2,0.512 Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 36/43

TOP-K Bounding rectangles focused only on location proximity • DIR-tree Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 37/43

TOP-K • DIR-tree(cont’) IR-tree DIR-tree Top-2 Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 38/43

TOP-K • Conclusions • Proposed a new indexing framework for location aware top-k text retrieval. • The frameworks integrates the inverted file for text retrieval and the R-tree for spatial proximity querying in a novel manner. • BUT it is only used for users to search local geo-information. Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’09 40/43

CONCLUSIONS & FUTURE WOK • Research of discovering users’ implicit geo intention is hot these years. • Some existing method based on large data training models, which is hard to adjust and used to other domains. • If it is local geo information, it comes to the question of kNN. • Except training methods, is there other way to model users’ implicit geo intention? 42/43

Thanks  Q&A？ 43/43

A NEW TOPIC:QUERIES WITH GEO-INFORMATION