1 / 18

Computing Semantic Similarities based on Machine-Readable Dictionaries

Computing Semantic Similarities based on Machine-Readable Dictionaries. Abstract. If two words have similar definitions , they are semantically similar . A definition is represented by a definition vector. Each dimension represents a word in the dictionary .

leann
Download Presentation

Computing Semantic Similarities based on Machine-Readable Dictionaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computing Semantic Similarities based on Machine-Readable Dictionaries

  2. Abstract • If two words have similar definitions, they are semantically similar. • A definition is represented by a definition vector. • Each dimension represents a word in the dictionary. • The score of each dimension in the vector is calculated by a variation of tf*idf.

  3. Introduction • Machine-Readable Dictionaries (MRDs) are human encoded knowledge about words. • Transforming a hard-copy dictionary into a machine-readable one is far easier than building a new lexical ontology.

  4. Basic Idea • If two words have something in common, then there will also be some common words in the definitions. • Two definition vectors generated from definitions

  5. Dictionaries and Preparation • Two machine readable dictionaries: • Longman Dictionary of Contemporary English (LDCE) • Only 2000 English words • ModernChinese Standardized Dictionary (MCSD) • POS: BMM (backward maximum matching algorithm) • Formal and regular • Fewer ambiguities in definitions than in free text • It can get nice result

  6. Measuring Similarities - 1/6 • Let W be the set of all words in a dictionary D. • W={w1, w2, w3, · · · , wmax}, • Through E, we have the definition vector of a word w.

  7. Measuring Similarities - 2/6 • i is the iteration time • If a word a occurs in the definition of word b but seldom occur in definitions of other words, then a is important for the explanation of b.

  8. Measuring Similarities - 3/6 • This paper uses r(w,wl) to measure the association between w and a word wlin its definition. • tf(w,wl): the occurrence counts of wlin w’s definitions. • ef(wl) : the number of words that have wlin their definitions.

  9. Measuring Similarities - 4/6 • C(wl, w) is iteratively calculated as:

  10. w w’ wl Measuring Similarities – 5/6 • S : a set of stop words • δβ,N : Top βN words, β is set to 0.6 • w’ ∈ e(w)∧(w’, wl) ∈ Ei−1. αis a weighting parameter. Words from ithiteration have a weight of αi−1.

  11. Measuring Similarities – 6/6 • Pearson’s product-moment correlation coefficient

  12. Evaluation

  13. Evaluation- Chinese set M&C data set

  14. 90 0.9 Evaluation - Chinese set

  15. Evaluation - Chinese set • Mdc: • Data sparseness problem of Chinese Web • Searching results in Chinese contains much more duplicates than its English counterpart. • Therefore, the double checking approach is not suitable for the Chinese task. • Mhow: • Mhowoutputs 1 for ”鳥,鶴”(bird, crane), since ”鳥” is the hypernym of ”鶴”. • But for ”熔爐” and ”火爐”, Mhow outputs very low similarity, which is contrary to human instinct.

  16. Evaluation – M&C Data Set Fig. 4 shows the affect of different α and iteration counts on the English data set.

  17. Conclusions • A novel method that uses dictionary as a main resource for measuring word similarities. • Each dimension of the vector represents a word and its value represents the importance of the word in the definition. • The importance value is calculated like tf*idf in definitions.

More Related