280 likes | 369 Views
A Simple Unsupervised Query Categorizer for Web Search Engines. Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies Research Center IIIT-Hyderabad 500 032. ICON 2010. Outline. Query categorization Related work Importance of ranking
E N D
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and VasudevaVarma Search and Information Extraction Lab Language Technologies Research Center IIIT-Hyderabad 500 032 ICON 2010
Outline • Query categorization • Related work • Importance of ranking • Challenges • Design goals • Our approach • Results • Conclusion
Query categorization (QC) • Automatic categorization (classification) of user queries into one or more of pre-defined categories • Note that categories are pre-defined and may vary across different applications • However, for a particular application categories remain the same over a reasonable amount of time
Contributions • Solving query categorization as a purely information retrieval problem • Emphasis on importance of ranking of categories for QC systems • Our system being simple and unsupervised in nature can establish a new baseline
Related Work • Text categorization techniques (Shen et al., 2005, 2006): • Solve QC as a text categorization problem • But queries are not as rich as text documents in terms of context • Text classifiers are trained with a static vocabulary, which may not account for the dynamic nature of the Web.
…Related Work • Graph based models (Diemert and Vandelle, 2009). • Constructing concept graphs built from search query logs • Once the concept graph is constructed, a query is categorized by traversing through the graph. • Not all search engines have the luxury of large search query logs.
Research Questions • Can we solve QC by considering it purely as an IR problem? • Can we combine the existing relatively standard IR techniques to solve QC? • Can already categorized corpus be used for conducting query categorization? • Can we establish a new baseline for QC systems?
Importance of ranking • Consider category listings of two hypothetical systems for the query “Ipod” • It is obvious from this example that ranking plays an important role for QC systems
Challenges • Category representation: • Categories need to be defined (covering most of the Web) • Each category needs to be represented by a set of documents that best describe that category. Category representation is needed in order to solve QC purely as an IR problem.
…Challenges • Query expansion/enrichment: Usually queries are very short. • Average query length in KDD Cup 2005 was 3.12 words. • 22.5% of the queries were of length 3 words. • 78.7% of the queries had at most 4 words.
Category Representation • Categories of Open Directory Project (ODP) for QC • Web documents that are classified under a category represent that category. • Approximately 2.4 million English documents (of ODP) to represent categories • These documents are classified into approximately 380K categories. • Here the assumption is that these categories cover the entire Web. • This corpus of ODP documents is used to perform QC.
Design Goals • Our design goals: • Simple • Unsupervised framework • Implementable on Web scale • To solve QC as a “search” problem since “search” is a task a Web search can afford for free.
Our Approach ODP documents Expanded Query ODP Categories Target Categories
Query Expansion • Pseudo relevance feedback query expansion • Submit query to a Web search engine • Collect stemmed terms (Q’) from title and snippets for top N search results • Stop word removal • Weight on document frequency (DF) measure
…Query Expansion • Common concepts for a query usually occur in most of the top web documents obtained for a query • This information is best captured by DF • These common concepts represent the query …………… ..…Tennis ..sports….. ………WTA Web Search Engine ……tennis… ……………… …..sports… ....…..WTA.. Tennis Sports WTA Wimbledon “Serena Williams” ……… ..........Tennis ………………… Wimbledon.
Central Idea • The ODP documents that match the query-related concepts are good enough to carry out QC • In essence, topically similar documents • This fact is leveraged in our unsupervised approach to QC
Query Categorization • Search the expanded query on the ODP Web document corpus • ODP documents retrieved for the query belong to at least one ODP category; resulting in query categorization • An optional taxonomy mapping in case target categories are different from that of ODP
Taxonomy Mapping for KDD Cup dataset • We map ODP categories to KDD cup categories to evaluate on KDD Dataset • Note that computation of these mappings is one time and offline
…Taxonomy Mapping • Search the target categories in the category ODP descriptions • For a target category t, let the set of retrieved ODP categories be C • Map every category in C to target category t. • Repeat this for other target categories, and obtain mappings
…Taxonomy Mapping • Let C(Q)be the set of ODP categories returned for a query Q • The categories in target space to which most of the categories of C(Q) are getting mapped to will be ranked higher • Top K categories in target space are returned as top K target categories for the query
Dataset • KDD Cup 2005 dataset (Lie et al., 2005) • A set of unlabeled 800K queries sampled MSN search query logs • 67 predefined categories • A set of 800 queries (sampled from the 800K queries) was labeled • Three labelers independently labeled this set • Each query was tagged with at most 5 categories • This dataset serves as the standard dataset for QC evaluation
Evaluation Metrics Precision, Recall and F1 are defined, respectively, as follows:
Results *High precision reported by KBS System is due to binary categorization
On Results • Though F1 reported for our system is marginally lower, we believe our system should be viewed from a different perspective • Solve QC purely as an information retrieval problem • Combined relatively standard techniques to solve QCmaking it • simple, and • implementable on a very large scale
….On Results • Our system is unsupervised in nature • Our system does not make use of resources like search query logs • Thus, we believe the results reported complement our design goals to a reasonable extent
Conclusion • A simple, unsupervised yet effective approach to query categorization • Leverages already categorized corpus (ODP) to perform QC • Advantages • Simple approach • Unsupervised • Existing IR techniques can be used • Avoids Multiclass classification