260 likes | 344 Views
Language Identification of Search Engine Queries. Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd.
E N D
Language Identification of Search Engine Queries HakanCeylanYookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission College Blvd. Denton,TX,76203 Santa Clara,CA,95054 hakan@unt.eduykim@yahoo-inc.com ACL 2009
outline • Introduction • Data Generation • Language Identification • Conclusions and Future Work
Introduction(1) • Decide in which language a given text is written • It is heavily studied • It is critical importance to search engines for queries • Challenges : lack of any standard or publicly available data set
Introduction(2) • A case where a correct identification of language is not necessary. example : query ”homo sapiens” , a user enter this query from Spain. Add a non-linguistic feature to system
Data Generation(1) • Data set : Constructed by the queries with clicked urls From : Yahoo! Search Engine for each language Time : three months time period
Data Generation(2) • Preprocess : • remove any numbers or special characters or extra spaces. • lowercase all the letters of the queries. • Calculating the frequencies of the urls for each query. • A web page is 474 words on the average • Identify the language for web page using one of the existing methods.
Data Generation(3) • Using Table 1(T1) and Table 2(T2) to store the above information T1 : [ q , u , fu ] T2 : [ u , l ] q : query u : a unique url u : url l : language identified for u fu : the frequency of u • Combine T1 and T2 into T3 T3 : [ q , l , fl , cu,l ] l : a language fl : the count of clicks for l cu,l: the count of unique urls in language l
Data Generation(4) • It has many noise. 1. A query maps to more than one language. solve : Giving a weight wq,l for each query to a language set a threshold parameter W if wq,l< W then remove this query 2.navigational query example : ACL 2009
Data Generation(5) Solve : set two threshold parameter F and U if Fq > F or Uq < U then remove this query • Algorithm
Data Generation(6) • How to turn our parameter dependent on the size of data set (Silverstein et al.,1999) W = 1 , F = 50 , U = 5 • How many query will be filter 5%~10% of the queries • Pick 500 queries randomly and annotate them by human Category-1: If the query does not contain any foreign terms. Category-2: If there exists some foreign terms but the query would still be expected to bring web pages in the same language. Category-3: If the query belongs to other languages, or all the terms are foreign to the annotator.
Data Generation(7) • How much of this multi-lingualityparameter selection eliminate? result : Category-1 : 47.6% Category-1+2 : 60.2%
Language Identification(1) • Implement three models use a different existing feature 1.statistical model 2.knowledge based model 3.morphological model • EuroParl Corpora • Combine all three models in a machine learning framework using a novel approach • Add a non-linguistic
Language Identification(2) • Test set-3500 human annotated queries
Statistical model • Character based n-gram feature (n=1 to 7) • Vocabulary from training corpus(EuroParl) • Generate a probability distribution from these count • Above work can use SRILM Toolkit with Kneser-Ney Discounting and interpolation
Knowledge based model • Word based n-gram feature (n=1) • Vocabulary from training corpus(EuroParl) • Generate a probability distribution from these count
Morphological model • Gather the affix information from corpora in an unsupervised(HaraldHammarstr¨om 2006) • Give a score for each affix
Language Identification(3) • Performance
Decision tree classification • Each model can complement the other in certain cases • Train data : automatically annotated data set • Feature : confidence score • Use the Kurtosis measure
Decision tree classification • An example : query “the sovereign individual” and statistical model identifies it as English k = 7.6 > = =(4.47+1.96) so this query’s confidence score is “en-HIGH” • Implement DT classifier by the Weka Machine Learning Toolkit (Witten and Frank,2005)
Decision tree classification • Outperform all the models for each size on average
Decision tree classification Mli,lj : language li misclassified by the system as lj
non-linguistic feature • Non-linguistic feature is the language information of the country • It helps the search engine in guessing the language example : query “how to tape for plantar fasciits”(it is labelled as Category-2) It is classified to Porteguese query
non-linguistic feature • Increase test set size to 430 queries
Conclusions • A completely automated method to generate a reliable data set • Built a decision tree classifier that improves the results on average • Built a second classifier that takes into account the geographical information of the users
Feature Work • To improve the accuracy of data generation • More careful examination in parameter values • To extend the number of languages in data set • Consider other alternatives to the decision tree framework