350 likes | 449 Views
Implementing Query Classification. HYP: End of Semester Update prepared Minh. Previously…. Web search queries: Understand user goal Broder (et al 2002): Queries are classified into 3 categories: Informational Navigational Transactional. Previously….
E N D
Implementing Query Classification HYP: End of Semester Update prepared Minh
Previously… • Web search queries: • Understand user goal • Broder (et al 2002): • Queries are classified into 3 categories: • Informational • Navigational • Transactional
Previously… • Functional Faceted Web Query Classification • Ambiguity: Polysemous, General, Specific • Authority Sensitivity: Yes - No • Spatial Sensitivity: Yes - No • Temporal Sensitivity: Yes - No • Query’s 4-Tuple: <Am, Au, S, T> • 3 * 2 * 2 * 2 = 24 different combinations.
Temporal Sensitivity • Definition: • A keyword is temporal sensitive if the results returned by querying it on web search engine tends to change with respect to time. • Example: • Temporal sensitive: Liverpool, Beyonce, Jennifer Hawkins, etc.. • Non-temporal sensitive: video, buying car, etc..
Up-to-date Project Scope • Objective: to analyze the temporal sensitivity facet of web search queries. • Problem: find the temporal correlation between web queries
Web Query Histogram • Periodic queries: • Non-periodic queries: Champions League Final Liverpool
Queries Correlation • Correlation • Observation: 2 keywords are temporally related to each other
Proposed System Framework • Ask Google Trends for query’s histogram • Use histogram digitizer program (Plotparser by WeiHua) to get the numerical data • Query Correlation: • Calculate correlation coefficient between queries • Query classification
Queries Correlation: 1st attempt • Calculate Correlation coefficient: • Using data of 45 months: Jan 2004 until September 2007 • Calculate coefficient based on the entire histograms
Result classification: 1st attempt • Data of 15 different popular keywords, of which: • Periodic keywords: • Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!). • Related keywords: • PS2, Xbox, Jack Nicholson, Beyonce , chocolate, chocolateNews, Liverpool, EA Sport, Konami • All keywords are compare to each other based on correlation coefficient of their histograms. • (15*14)/2 = 105 instances
Result classification: 1st attempt • Classification based on threshold method: • Statistical result: • Threshold value: 0.25
1st attempt Problems: • Very low threshold value • Only one feature used. • Using entire histogram, while some keywords are only temporally related to each other at some periods of time. • Example: Valentine – Chocolate (Correlation appears during February)
Queries Correlation: 2nd attempt • Interesting period: • Period in which two query are highly related to each other • -> Segmentation (Clustering) problem
Clustering Using Simple K means • Algorithm to predict no. of clusters • Use WEKA to cluster the histogram
Query Correlation: 2nd attempt • Periodic keywords detection: • Identify repeated pattern using correlation • Periodic query tends to have highly correlation coefficient on repeated part.
Interesting Periods Projection • Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram
Result Classification: 2nd Attempt • Using previous dataset • Related keywords are compared with each of periodic keywords for correlation • Result: • Manage to increase threshold value to: 0.5
2nd attempt problems • K – means clustering does not guarantee correct interesting periods detection: • Due to the fact that we have to provide no. of cluster for K-means • -> implemented algorithm to determine no. of cluster failed to provide correct value • Small training data set. • Too simple method of threshold detector.
Queries Correlation: 3rd attempt • Need to find another way to identify interesting period. • Peak period: • Period in which there is a high peak in query volume • Peak detection problem: • Mapping and smoothing using convolution
Clustering using peak detection • Mapping:
Clustering using peak detection • Smoothing using convolution:
Clustering using peak detection • Peak Detection: using simple slope-change algorithm to determine peaks and valleys • (with threshold value: mean)
Interesting periods Projections • Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram and vice versa
Result Classification: 3rd attempt • Use large training data: • 47 popular keywords, of which: • 15 periodic keywords and 32 related keywords • Each related keyword is to compared with every periodic keyword to get correlation coefficient (Coef). • Data size: 15 * 32 = 480 instances
Result Classification: 3rd attempt • Apply Naïve Bayes Classifier (WEKA): • 6 features: • Average Coef from related keyword projection (AveRCoef) • Average Coef from periodic keyword projection (AvePCoef) • Overall Average Coef [= (AveRCoef+AvePCoef)/2] • Max Coef from related keyword projection (MaxRCoef) • Max Coef from periodic keyword projection (MaxPCoef) • Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]
Result Classification: 3rd attempt • Statistical Result: • Confusion Matrix
Future attempt: Query Normalization • Search volumes tends to increase as the Internet becomes more popular • Histogram for Top 20 most popular keywords of all time:
Future attempt: Normalization • Histograms need to be normalize to ignore this trend’s effect! • Proposed action: • Subtract time effect • Current Problem: More distortions are added due to scaling problem. • -> histogram from Google have been scaled. We have no information of raw data.
Future attempt: From Periodic to Non-periodic • Find the correlation between two non-periodic queries. • Proposed Problem: some keywords are highly searched after other keywords • Example: “tsunami” is usually searched after “earthquake” is issued.
Future attempt: From Periodic to Non-Periodic Earthquake Tsunami
Potential Applications • Results re-ranking: • Move result that is more up-to-date up on the result list • Example: when user ask for Beyonce during the time of Grammy -> result that related to Grammy will have a higher rank • Server Buffering: • When user query Beyonce, the web page that related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.