280 likes | 399 Views
Context-Aware Query Classification. Huanhuan Cao 1 , Derek Hao Hu 2 , Dou Shen 3 , Daxin Jiang 4 , Jian-Tao Sun 4 , Enhong Chen 1 and Qiang Yang 2 1 University of Science and Technology of China, 2 Hong Kong University of Science and Technology, 3 Microsoft Corporation
E N D
Context-Aware Query Classification Huanhuan Cao1, Derek Hao Hu2, Dou Shen3, Daxin Jiang4 , Jian-Tao Sun4 , Enhong Chen1 and Qiang Yang2 1University of Science and Technology of China, 2Hong Kong University of Science and Technology, 3Microsoft Corporation 4Microsoft Research Asia
Motivation • Understanding Web user's information need is one of the most important problems in Web search. • Such information could generally help improving the quality of many Web search services such as: • Ranking • Online advertising • Query suggestion, etc.
Challenges • The main challenges of query classification: • Lack of feature information • Ambiguity • Multiple intents • The first problem has been studied widely: • Query expansion by top search results • Leverage a web directory • However, the second and the third problems are far away from being closed.
Why context is useful? • Context means the previous queries and clicked URLs in the same session given a query. • It’s assumed that: • Context has semantic relation with the current query. • Context may help to label appropriate categories for current query. • It makes sense to exploit context for specifying the current query.
Overview • Problem statement • Model query context by CRF • Features of CRF • Experiment • Conclusion and future work
Problem Statement: Context • In a user search session, suppose the user has raised a series of queries as q1q2…qT-1and clicked some returned URLs U1U2…UT-1; • If the user raises a query qTat time T, we call q1q2…qT-1 and U1U2…UT-1 as query context of qT • And we call qt t (t ∈[1, T - 1])as contextual queries of qT .
Query Context Query Context of {Q_T}
Problem Statement: QC with context and Taxonomy • The objective of query classification (QC) with context is to classify a user query qTinto a ranked list of K categoriescT1, cT2, ..., cTK, among Nccategories{c1,c2,…,cNc}, given the context of qT . • A target taxonomy Υ is a tree of categories where {c1,c2,…,cNc}are leaf nodes of this tree.
Modeling Query Context by CRF where q represents q1q2…qt
Why CRF? • The two main advantages of CRF are: • 1) It can incorporate general feature functions to model the relation between observations and unobserved states; • 2) It doesn't need prior knowledge of the type of conditional distribution. • Given 1), we can incorporate some external web knowledge. • Given 2), we don’t need any assumptions of the type of p(c|q).
Features of CRF • When we use CRF to model query context, one of the most important part is to choose effective feature functions. • We should consider: • Relevance between queries and category labels for leveraging local information of queries; • Relevance between adjacent labels for leveraging contextual information.
Relevance between queries and category labels • Term occurrence • The terms of qtare obvious features for supporting ct • Due to the limited size of training data, many useful terms indicating category information may be uncovered. • General label confidence • Leverage an external web directory such as Google Directory; • where M meansthe number of returned results and Mct,qt means the number of returned results with label ct after mapping.
Relevance between queries and category labels • Click-aware label confidence • Combining the click-information with the knowledge of a external web directory; • CConf(ct ,ut) can be calculated by multiple approaches. • Here, we use VSM to calculate cosine similarity between term vectors of ct and ut
Relevance between Adjacent Labels • Direct relevance between adjacent labels • Occurrence of adjacent label pair <ct-1,ct> • The weight implies how likely the two labels co-occur • Taxonomy based relevance between adjacent labels • Limited by the sampling approach and size of the training data, some reasonable adjacent label pairs may not occur proportionally or even not occur at all. • Consider indirect relevance between adjacent labels by considering the taxonomy.
Experiment • Data set: • 10,000 random selected sessions from one day’s search log of a commercial search engine. • Three labelers firstly label all possible categories with KDDCUP’05 taxonomy for each unique query of the training data.
Examples of multiple category queries A large ratio of multiple category queries implies the difficulty of QC without context.
Label Sessions • Then the three human labelers are asked to cross label each session of the data set with a sequence of level-2 category labels. • For each query, a labeler gives a most appropriate category label by considering: • Query itself; • The query context; • Clicked URLs of the query.
Tested Approaches • Baselines: • Non context-aware baseline: Bridging classifier(BC) proposed by Shen et al. • Naïve context-aware baseline: Collaborating classifier(CC). Combine a test query and the previous query to classify with BC. • CRFs: • CRF-B: CRF with basic features including term occurrence, general label confidence and direct relevance between adjacent labels) • CRF-B-C: CRF with basic features + click-aware label confidence) • CRF-B-C-T: CRF with basic features + click-aware label confidence + taxonomy based relevance)
Evaluation Metrics • Given a test session q1q2…qT, we let the qTbe the test query and let queries q1q2…qT-1 and corresponding clicked URL sets U1U2…UT-1 be the query context. • For qT ,we evaluate a tested approach by: • Precision(P): δ(cT ∈ CT,K)/K • Recall(R): δ(cT ∈ CT,K) • F1 score(F1 ): 2*P*R/(P+R) Where cT meansthe ground truth label and CT,K means a set of the top K labels. δ(*) is a Boolean function of indicating whether * is true (=1) or false (=0).
Overall results 1) The naïve context-aware baseline consistently outperforms the non context-aware baseline. 2) CRFs consistently outperform the two baselines. 3) CRF-B-C-T > CRF-B-C > CRF-B: click information and taxonomy based relevance are useful.
Case study Context about travel Click a travel guide web page Give the most appropriate label in the first position
Efficiency of Our Approach • Offline training: • Each iteration takes about 300ms • Time cost of training a CRF is acceptable • Online cost: • Calculating features • Label confidence
Conclusion and Future work • In this paper, we propose a novel approach for query classification by modeling query context via CRFs. • Experiments on a real search log clearly show that our approach outperforms a non context-aware baseline and a naive context-aware baselines. • Current approach cannot leverage the contextual information of the beginning queries of sessions, which make us carry on our following researches for leveraging more contextual information out of sessions.