90 likes | 176 Views
Frequency Estimates for Statistical Word Similarity Measures. Egidio Terra and C.L.A. Clarke School of Computer Science University of Waterloo. Presenter: Cosmin Adrian Bejan. Introduction.
E N D
Frequency Estimates for Statistical Word Similarity Measures Egidio Terra and C.L.A. Clarke School of Computer Science University of Waterloo Presenter: Cosmin Adrian Bejan
Introduction • A comparative study of two methods for estimating word cooccurence frequencies required by word similarity measures to solve human-oriented language tests. • Example of such tests: • determine the best synonym in a set of alternatives A={A1, A2, A3, A4} for a specific target word TW in a context C={w1’, w2’, … wn’} \ TW. • determine the best synonym when no context is available
Measuring Word Similarity • the notion for cooccurence of two words can be depicted by a contingency table: • each dimension represents a random discrete variable Wi with range A = {wi, wi}; • each cell represent the joint frequency where Nmax is the maximum number of cooccurences.
Similarity between two words Pointwise Mutual Information Χ2- test Likelihood ratio Average Mutual Information
Context supported similarity Cosine of Pointwise Mutual Information L1 norm Contextual Average Mutual Information Contextual Jensen- Shanon Digergence Pointwise Mutual Infor- mation of Multiple words
Window-oriented approach • fw_i – frequency of wi • fw_1,w_2 – cooccurence frequency of w1 and w2 • N – size of the corpus in words • P(wi) = fw_i/N • fw_1,w_2 is estimated by the number of windows where the two words cooccur. • Nwt – number of windows of size t • P(w1, w2) = fw_1,w_2 / Nwt
Document-oriented approach • dfw_i – frequency of a word wi. It corresponds to the number of documents in which the words appears. • D – the number of documents • P(wi) = dfw_i/ D • dfw_1,w_2 – cooccurence frequency of two words – is the number of documents where the words cooccur. • P(w1, w2) = dfw_1,w_2 / D