220 likes | 332 Views
Behavior-driven clustering of queries into topics. Luca Maria Aiello Debora Donato Umut Ozertem Filippo Menczer. CIKM 2011, Glasgow. Granularity levels. Query Session Goal Mission Topic. Concise representation. Aggregation. Meaningful semantics. USER PROFILING IN SEARCH ENGINES.
E N D
Behavior-driven clustering of queries into topics Luca Maria Aiello Debora Donato UmutOzertem FilippoMenczer CIKM 2011, Glasgow
Granularity levels Query Session Goal Mission Topic Concise representation Aggregation Meaningful semantics USER PROFILING IN SEARCH ENGINES CIKM 2011
A search mission can be identified as a set of queries that express a complex search need, possibly articulated in smaller goals A topic is a mental object or cognitive content, i.e., the sum of what can be perceived, discovered or learned about any real or abstract entity. MISSIONS AND TOPICS CIKM 2011
Queries in the same mission Queries in consecutive missions Donato et. al: Do you want to take notes? Identifying research missions in Y! search pad. WWW’10 Taxonomies User behavior and intent Same topic Different topic QUERY STREAM DECOMPOSITION CIKM 2011
MERGING MISSIONS CIKM 2011
Gradient Boosted Decision Tree (GBDT) • Aggregation (min, max, avg, std) of 62 query pair features AUC 0.95 10X cross validation on 500K pairs TOPIC DETECTOR STATS CIKM 2011
Topic detector appliedtopairsofquerysets • O(log|M|·|M|2) (heavily parellelizable) 1. Missions of the same user supermissions 2. Query sets of different users higher-level topics GREEDY AGGLOMERATIVE TOPIC EXTRACTION (GATE) CIKM 2011
40K users EVALUATION 3 months Y! log
URL cover graph 2 • OSLOM community detection algorithm • Weighted undirected graph • Maximizing local fitness function of clusters • Automatic hierarchy detection Lancichinetti et. al: Finding statistically significant communities in networks. PLoS ONE, 2011. EVALUATION: BASELINE CIKM 2011
Fraction of queries considered in the clustering phase GATE: 1 OSLOM 0.2 URL cover graph connected components size distribution EVALUATION: QUERY SET COVERAGE CIKM 2011
Fraction of queries that remains isolated in singleton GATE: 0.55-0.27 OSLOM 0.88 EVALUATION: SINGLETON RATIO CIKM 2011
Topics aggregated in two consecutive steps or levels GATE: 500k OSLOM:100K EVALUATION: AGGREGATION ABILITY CIKM 2011
Coverage • Number of unique clicked URLs for the query • Purity • Average pointwise mutual information of pairs of query-related relevant terms • Relevant terms are extracted from top clicked results using a predefined dictionary EVALUATION: PURITY vs. COVERAGE CIKM 2011
EVALUATION: PURITY vs. COVERAGE CIKM 2011
EVALUATION: PURITY vs. COVERAGE CIKM 2011
Missions Topic Detector Topics User topical profile 1.9 0.0 0.7 3.2 0.0 0.41 0.0 2.9 0.24 0.35 USER PROFILING FROM TOPICS CIKM 2011
Sequenceofmissionsof the profileduser vs. sequenceof a randomone • Sequence-profile match usingtopic detector • Success: 0.65 (0.72 lessfrequent, 0.55 mostfrequent) PROFILES FOR “PREDICTION” CIKM 2011
New behavior-driven notion of topics • Bottom-up topic extraction algorithm • Favorable comparison with graph-based clustering • Effective user profiling • Other baselines • More accurate predictions CONCLUSIONS CIKM 2011
ACKNOWLEDGMENTS FilMenczer Prof. Informatics @ IU Director CNetS @IU EmreVelisapaoglu Yahoo! SearchSciences Yahoo! Labs @ Sunnyvale UmutOzertem Yahoo! SearchSciences Yahoo! Labs @ Sunnyvale Debora Donato Yahoo! SearchSciences Yahoo! Labs @ Sunnyvale
Taxonomies User behavior and intent CIKM 2011