350 likes | 467 Views
Tagging with Queries: How and Why?. Ioannis Antonellis antonell@cs.stanford.edu Hector Garcia-Molina hector@cs.stanford.edu Jawed Karim jawed@cs.stanford.edu. Content on the Web. Back Link Text. Search queries. Page Text. Forward Link Text. Cnn Obama Critics news. How?.
E N D
Tagging with Queries: How and Why? Ioannis Antonellis antonell@cs.stanford.edu Hector Garcia-Molina hector@cs.stanford.edu Jawed Karim jawed@cs.stanford.edu
Content on the Web Back Link Text Search queries Page Text Forward Link Text Cnn ObamaCriticsnews Stanford Infolab
How? • Basic observation: http referrer field contains search query Stanford Infolab 3
How? Stanford Infolab
How? • Basic observation: http referrer field contains search query 1) Extract queries from web access log Stanford Infolab 5
Web Access Log a997c1950718d75c03f22ca8715e50b3 [28/Feb/2007:23:45:47 -0800] /group/svsa/cgi-bin/www/officers.php http://www.google.com/search?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts a64344ffd6638d0f6fb2a0284f98b28b [28/Feb/2007:23:45:49 -0800] /group/King/ "http://www.google.com.au/search?hl=en&q=Martin+Luther+King&meta=" 413fa663474b2288c1661882e7e62aea [28/Feb/2007:23:46:02 -0800] /group/pandegroup/folding/results.html "http://www.google.com/search?sourceid=navclient-menuext&ie=UTF-8&q=RESULTS" 3d2edd4dfa7778da92875ee67a319433 [28/Feb/2007:23:46:03 -0800] /group/vpge/sgsi/entrepreneurship/ "http://www.google.com/search?hl=en&q=summer+institute+of+entrepreneurship" ac49793239a6c490023e460fd4863a48 [28/Feb/2007:23:46:06 -0800] / "http://www.google.com/search?sourceid=navclient&hl=ko&ie=UTF-8&rlz=1T4SUNA_ko___KR209&q=stanford" 1c9893680 Stanford Infolab
How? • Basic observation: http referrer field contains search query 1) Extract queries from web access log 2) Embed Javascript code in web pages that capture search queries Stanford Infolab 7
Embeddable code Stanford Infolab 8
How? • Basic observation: http referrer field contains search query 1) Extract queries from web access log 2) Embed Javascript code in web pages and capture search queries • Convince server administrator/page onwer Stanford Infolab 9
Query tags Stanford Infolab 11
Information value of Query Tags WebBase • Datasets: • Stanford Query Logs: 360,000 URLs, 900,000 query tags • Delicious@Stanford: 3,000 URLs, 5,500 tags Stanford Infolab 12
Experiments - Summary • URLs coverage • Query vs Delicious Tags • Query/Delicious Tags vs Pagetext Stanford Infolab
URLs coverage • Query logs provide tags for ~110 times more URLs than delicious • 13% of delicious URLs (380 URLs) only tagged by delicious Stanford Infolab 14
Query Tags • Query logs provide 42 query tags per URL on average Stanford Infolab 15
Delicious Tags • Delicious provides 3 tags per URL on average Stanford Infolab 16
Tags for common URLs • Query logs provide 250 query tags per URL on average for common URLs • Delicious provides 5 tags per URL on average for common URLs Stanford Infolab 17
Query Tags vs Page Text • For every URL, 1 out of 3 query tags are not present in the pagetext Stanford Infolab 18
Delicious Tags vs Page Text • For every URL, 1 out of 2 query tags are not present in the pagetext Stanford Infolab 19
Tags for common URLs • For common URLs, 1 out of 2 query/delicious tags not present in the pagetext Stanford Infolab 20
Conclusions Query tags: Can be extracted in a distributed fashion new promising source of information can provide substantially many, new tags, for a large fraction of the Web Stanford Infolab 21
Thank You! (DEMO) http://tags.stanford.edu Stanford Infolab 22
How? Stanford Infolab 33