Comparing Word Relatedness Measures Based on Google n-grams

ComparingWord Relatedness MeasuresBased on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada islam@cs.dal.ca, eem@cs.dal.ca, vlado@cs.dal.ca COLING 2012

Introduction • Word-relatedness has a wide range of applications • IR: Image retrieval, Query extention… • Paraphrase recognition • Malapropism detection and correction • Automatic creation of thesauri • Speak recognition • …

Introduction • Methods can be categorized into 3: • Corpus-based • Supervised • Unsupervised • Knowledge-based • Semantic resources were used • Hybrid

Introduction • This paper focus on unsupervised corpus-based measures • 6 measures have been compared

Problem • Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies • The co-occurrence are corpus-specific • Most of the corpura doesn't have co-occurrence stats, thus can't be used on-line • Some use web search result, but results vary from time to time

Motivation • How to compare different measures fairly? • Observation • Co-occurrence stats were used • A corpus with co-occurrence information, eg. Google n-grams, is probably a good resource

Google N-Grams • A publicly available corpus with • Co-occurrence statistics (uni-gram to 5-gram) • A large volume of <del>web text</del> • Digitalized books with over 5.2 million books published since 1500 • Data format: • ngram year match_count volume_count • eg: • analysis is often described as 1991 1 1 1

Another Motivation • To find a indirect mapping between Google n-grams and web search result • Thus, it might be used on-line

How About WordNet? • In 2006, Budanitsky and Hirst evaluated 5 knowledge-based measures using WordNet • Create a resource like WordNet requires lots of efforts • Coverage of words is not enough for NLP tasks • Resource is language-specific, while Google n-grams consists more than 10 languages

Notations • C(w1 … wn) • Frequency of the n-gram • D(w1 … wn) • # of web docs (up to 5-grams) • M(w1, w2) • C(w1 wi w2)

Notations • (w1, w2) • 1/2 [ C(w1 wi w2) + C(w2 wi w1) ] • N • # of docs used in Google n-grams • |V| • # of uni-grams in Google n-grams • Cmax • max frequency in Google n-grams

Assumptions • Some measures use web search results, and co-occurrence information not provided by Google n-gram, but • C(w1) ≥ D(w1) • C(w1 w2) ≥ D(w1 w2) • It is because uni-grams and bi-grams might occurs multiple times in one document

Assumptions • Considering the lower limits • C(w1) ≈ D(w1) • C(w1 w2) ≈ D(w1 w2)

Measures • Jaccard Coefficient • Simpson Coefficient

Measures • Dice Coefficient • Pointwise Mutual Information

Measures • Normalized Google Distane (NGD)variation

Measures • Relatedness based on Tri-grams (RT)

Evaluation • Compare with human judgments • It is considered to be the upper limit • Evaluate the measures with respect to a particular application • Evaluate relatedness of words • Text Similarity

Compare With Human Judgments • Rubenstein and Goodenough's 65 Word Pairs • 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0 • Miller and Charles' 28 Noun Pairs • Restricting R&G to 30 pairs, 38 human judges • Most of researchers use 28 pairs because 2 were omitted from early version of WordNet

Result

Application-based Evaluation • TOEFL's 80 Synonym Questions • Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word • ESL's 50 Synonym Qeustions • Same as TOEFL's 80 synonym questions task • Expect the synonym questions are from English as a 2nd Language tests

Result

Text Similarity • Find the similarity between two text items • Use different measures on a single text similarity measure, and evaluate the results of the text similarity measure based on a standard data set • 30 sentences pairs from one of most used data sets were used

Result

Result • Pearson correlation coefficient with mean human similarity ratings: • Ho et al. (2010) used one measure based-on WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895 • Tsatsaronis et al. (2010) achieved 0.856 • Islam et al. (2012) achieved 0.916 • The improvement over Ho et al. (2010) is statistically significant at 0.05 level

Conclusion • Any measures uses n-gram statistics can easily apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks • Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions

Conclusion • Measures based on n-gram are language-independent • Other languages can be implemented if it has a sufficiently large n-gram corpus

Comparing Word Relatedness Measures Based on Google n-grams

Comparing Word Relatedness Measures Based on Google n-grams

Presentation Transcript

Relatedness

Syntactic relatedness

What are n-grams good for?

N-Grams and Corpus Linguistics

Simple, Effective, Robust Semi-Supervised Learning, Thanks To Google N-grams

N-Grams and Corpus Linguistics

Word-counts and N-grams

Correlation of Term Count and Document Frequency for Google N-Grams

N-Grams and Corpus Linguistics

N-Grams and Corpus Linguistics

On-Line Cumulative Learning of Hierarchical Sparse n -grams

Using Semantic Relatedness for Word Sense Disambiguation

From Grammar to N-grams

Chapter 4: N-GRAMS

6. N-GRAMs

Evaluation of N-grams Conflation Approach in Text-based Information Retrieval

Word Bi-grams and PoS Tags

Language Modeling with N-Grams

N-Grams

Comparing Word Clouds

What are n-grams good for?

Partitioning Sequences Based on Association Measures