140 likes | 301 Views
Potential Query Log Sets . Alexander Yeh MITRE Corp. October 2008. Possible Issues with a "Query Log" Corpus. Resembles queries of real interest to somebody Has some 'geo' aspect Multi-lingual Mitre in-house has limitations on languages
E N D
Potential Query Log Sets Alexander Yeh MITRE Corp. October 2008
Possible Issues with a "Query Log" Corpus • Resembles queries of real interest to somebody • Has some 'geo' aspect • Multi-lingual • Mitre in-house has limitations on languages • Permission to use and distribute (even after the evaluation)
More Recent Suggestions (While at Workshop) • Local search queries from various Wikipedias • Multi-lingual • Privacy? -probably not as bad as other search logs (more like encyclopedia lookup) • Permission? • Long enough to be interesting from a "geo" standpoint?
More Recent Suggestions (Continued) • Treat GikiP topics as queriesE.g.: GP4 "Which Swiss cantons border Germany?” • Multi-lingual, have permission, no privacy problem • Combine with GikiP 2009 for publicity purposes • But few in number (15 in 2008 pilot) • Realistic enough? • Use logs generated by an evaluation (like iCLEF) • Multi-lingual, permissions & privacy dealt with • But realistic enough? • Has "geo" aspect?
More Recent Suggestions (Concluded) • Timway search logs from Hong Kong • Chinese, English, usually 1 language in a query • Used in some studies, but usual permission & privacy issues • Also, finding annotator(s) may be an issue: • Chinese probably in Cantonese (versus "official" Mandarin dialect) - not too bad in written form • Probably traditional characters (not mainland China’s simplified characters)
Potential Query Log Data Sets - 1 • Tumba! (Diana Santos, Nuno Cardoso and others) • Available, large amount, a lot not released before • In Portuguese: need to hire and train somebody who can annotate Portuguese
Potential Query Log Data Sets - 2 • Workshop on Web Search Click Data 2009 (WSCD 2009) • http://research.microsoft.com/users/nickcr/wscd09/ • MSN search query log • Large amount, relatively new (and so not seen as much) • Pursuing getting permission (asking Nick Craswell) • Cancelled query parsing task in CLEF 2008 • Current status: cannot release data outside of Microsoft
Potential Query Log Data Sets - 3 • Query parsing task in CLEF 2007 • Query log of 800K English queries (unlabeled), 100 queries of labeled training data and 500 queries of test data • Presumably this log is still available for use in a new query parsing task. • Use same set, but generate new training and test • One disadvantage: the CLEF community is already familiar with this data set
Can Easily Obtain the Following Query Log Data Sets, But … • Can easily obtain a number of data-sets, but • They are old, and so may have been already seen by the CLEF community • Problems getting permissions to use these • Anticipate problems, or • Been asked not to use
Query Log Data Sets that are Easy to Obtain • KDD Cup 2005: Ying Li, a co-chair, asked us not to use • AlltheWeb_2001.gz, AlltheWeb_2002.gz, AltaVista_2002.zip: Jim Jansen: the data sharing agreement has expired • Excite_1997_small.zip, Excite_1997_large.zip, Excite_1999.zip, Excite_2001.gz: from Jim Jansen. Need Excite's permission?
Query Log Data Sets that are Easy to Obtain (Concluded) • AOL query log: from http://gregsadetsky.com/aol-data/ • Was made available to the public for awhile • Created a controversy about privacy • But all these data sets will have similar privacy issues
A Way to ‘Use' these Data-Sets (John Burger): • Use the existing logs as 'inspiration' for a made-up log corpus • May have been done by others, like NIST • Will not need permission • Will not have been seen before • Can insure no privacy disclosures • But will take time to produce the made-up data
Privacy Concerns • Though most well known with the AOL query logs, all these data sets may contain private data • One way to 'remove': use the existing logs as 'inspiration' for a made-up log corpus (mentioned above) • A fast, incomplete way to remove private data:remove the query timestamps and links indicating which queries came from the same site and randomize the order of the queries • A lot of the 'disclosures' comes from grouping the queries to a common source • But the removed information is now not available to a query parser
Privacy Concerns (Concluded) • A slower, more complete way to remove private data:review the data (perhaps as it is annotated) and flag any ones with private data • Either substitute the flagged data with fictional information or remove the queries with flags from the data sets