Frequency Estimates for Statistical Word Similarity Measures

Frequency Estimates for Statistical Word Similarity Measures Egidio Terra and C.L.A. Clarke School of Computer Science University of Waterloo Presenter: Cosmin Adrian Bejan

Introduction • A comparative study of two methods for estimating word cooccurence frequencies required by word similarity measures to solve human-oriented language tests. • Example of such tests: • determine the best synonym in a set of alternatives A={A1, A2, A3, A4} for a specific target word TW in a context C={w1’, w2’, … wn’} \ TW. • determine the best synonym when no context is available

Measuring Word Similarity • the notion for cooccurence of two words can be depicted by a contingency table: • each dimension represents a random discrete variable Wi with range A = {wi,  wi}; • each cell represent the joint frequency where Nmax is the maximum number of cooccurences.

Similarity between two words Pointwise Mutual Information Χ2- test Likelihood ratio Average Mutual Information

Context supported similarity Cosine of Pointwise Mutual Information L1 norm Contextual Average Mutual Information Contextual Jensen- Shanon Digergence Pointwise Mutual Infor- mation of Multiple words

Window-oriented approach • fw_i – frequency of wi • fw_1,w_2 – cooccurence frequency of w1 and w2 • N – size of the corpus in words • P(wi) = fw_i/N • fw_1,w_2 is estimated by the number of windows where the two words cooccur. • Nwt – number of windows of size t • P(w1, w2) = fw_1,w_2 / Nwt

Document-oriented approach • dfw_i – frequency of a word wi. It corresponds to the number of documents in which the words appears. • D – the number of documents • P(wi) = dfw_i/ D • dfw_1,w_2 – cooccurence frequency of two words – is the number of documents where the words cooccur. • P(w1, w2) = dfw_1,w_2 / D

Results for TOEFL test set

Results for TS1 and context

Frequency Estimates for Statistical Word Similarity Measures

Frequency Estimates for Statistical Word Similarity Measures

Presentation Transcript

Statistical Measures 1

Frequency Word # 4

Asymmetric Word Similarity

Word Meaning and Similarity

Word Meaning and Similarity

Similarity Measures for Text Document Clustering

Document Similarity Measures

Learning Term-weighting Functions for Similarity Measures

Word Similarity

Statistical Frequency in Word Segmentation

Similarity Measures for Rhythmic Sequences

Statistical Inference for Frequency Data

Statistical Measures

Word Similarity

Distance and Similarity Measures

Statistical Measures for Corpus Profiling

4.1 Statistical Measures

Lecture 22 Word Similarity

Distributional word Similarity

Measures of Text Similarity

Word Meaning and Similarity

Similarity Measures for Rhythmic Sequences