290 likes | 426 Views
Comparing Word Relatedness Measures Based on Google n-grams. Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada islam@cs.dal.ca, eem@cs.dal.ca, vlado@cs.dal.ca COLING 2012. Introduction.
E N D
ComparingWord Relatedness MeasuresBased on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada islam@cs.dal.ca, eem@cs.dal.ca, vlado@cs.dal.ca COLING 2012
Introduction • Word-relatedness has a wide range of applications • IR: Image retrieval, Query extention… • Paraphrase recognition • Malapropism detection and correction • Automatic creation of thesauri • Speak recognition • …
Introduction • Methods can be categorized into 3: • Corpus-based • Supervised • Unsupervised • Knowledge-based • Semantic resources were used • Hybrid
Introduction • This paper focus on unsupervised corpus-based measures • 6 measures have been compared
Problem • Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies • The co-occurrence are corpus-specific • Most of the corpura doesn't have co-occurrence stats, thus can't be used on-line • Some use web search result, but results vary from time to time
Motivation • How to compare different measures fairly? • Observation • Co-occurrence stats were used • A corpus with co-occurrence information, eg. Google n-grams, is probably a good resource
Google N-Grams • A publicly available corpus with • Co-occurrence statistics (uni-gram to 5-gram) • A large volume of <del>web text</del> • Digitalized books with over 5.2 million books published since 1500 • Data format: • ngram year match_count volume_count • eg: • analysis is often described as 1991 1 1 1
Another Motivation • To find a indirect mapping between Google n-grams and web search result • Thus, it might be used on-line
How About WordNet? • In 2006, Budanitsky and Hirst evaluated 5 knowledge-based measures using WordNet • Create a resource like WordNet requires lots of efforts • Coverage of words is not enough for NLP tasks • Resource is language-specific, while Google n-grams consists more than 10 languages
Notations • C(w1 … wn) • Frequency of the n-gram • D(w1 … wn) • # of web docs (up to 5-grams) • M(w1, w2) • C(w1 wi w2)
Notations • (w1, w2) • 1/2 [ C(w1 wi w2) + C(w2 wi w1) ] • N • # of docs used in Google n-grams • |V| • # of uni-grams in Google n-grams • Cmax • max frequency in Google n-grams
Assumptions • Some measures use web search results, and co-occurrence information not provided by Google n-gram, but • C(w1) ≥ D(w1) • C(w1 w2) ≥ D(w1 w2) • It is because uni-grams and bi-grams might occurs multiple times in one document
Assumptions • Considering the lower limits • C(w1) ≈ D(w1) • C(w1 w2) ≈ D(w1 w2)
Measures • Jaccard Coefficient • Simpson Coefficient
Measures • Dice Coefficient • Pointwise Mutual Information
Measures • Normalized Google Distane (NGD)variation
Measures • Relatedness based on Tri-grams (RT)
Evaluation • Compare with human judgments • It is considered to be the upper limit • Evaluate the measures with respect to a particular application • Evaluate relatedness of words • Text Similarity
Compare With Human Judgments • Rubenstein and Goodenough's 65 Word Pairs • 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0 • Miller and Charles' 28 Noun Pairs • Restricting R&G to 30 pairs, 38 human judges • Most of researchers use 28 pairs because 2 were omitted from early version of WordNet
Application-based Evaluation • TOEFL's 80 Synonym Questions • Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word • ESL's 50 Synonym Qeustions • Same as TOEFL's 80 synonym questions task • Expect the synonym questions are from English as a 2nd Language tests
Text Similarity • Find the similarity between two text items • Use different measures on a single text similarity measure, and evaluate the results of the text similarity measure based on a standard data set • 30 sentences pairs from one of most used data sets were used
Result • Pearson correlation coefficient with mean human similarity ratings: • Ho et al. (2010) used one measure based-on WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895 • Tsatsaronis et al. (2010) achieved 0.856 • Islam et al. (2012) achieved 0.916 • The improvement over Ho et al. (2010) is statistically significant at 0.05 level
Conclusion • Any measures uses n-gram statistics can easily apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks • Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions
Conclusion • Measures based on n-gram are language-independent • Other languages can be implemented if it has a sufficiently large n-gram corpus