150 likes | 318 Views
Measuring Semantic Similarity between Words Using HowNet. ICCSIT 2008 Liuling DAI , Yuning XIA , Bin LIU , ShiKun WU School of Computer Science, Beijing Institute of Technology. HowNet. W_C= 工夫 DEF={Ability| 能力 :host={human| 人 }} DEF={Strength| 力量 :host={group| 群體 }{human| 人 }}
E N D
Measuring Semantic Similarity between Words Using HowNet ICCSIT 2008 Liuling DAI , Yuning XIA , Bin LIU , ShiKun WUSchool of Computer Science, Beijing Institute of Technology
HowNet • W_C=工夫 • DEF={Ability|能力:host={human|人}} • DEF={Strength|力量:host={group|群體}{human|人}} • DEF={time|時間} • Word : 工夫 • Concept : {Ability|能力:host={human|人}} • Sememe : Ability|能力
Algorithms • Similarity between sememes • Similarity between concepts • Similarity between words • Amendment with thesaurus
Similarity between sememes • Strategy 1 • Strategy 2 • d : Distance between S1 and S2 • h : Depth of the first common parent node of the two sememes • α , β : Parameters to adjust d,h
Similarity between concepts • Word “Doctor” • DEF={human|人:{own|有:possession={Status|身分:domain={education|教育},modifier={HighRank|高等:degree={most|最}}},possessor={~}}} • Human → Primary sememe • Status, own … → Modifying sememe • Possession , domain …→ Descriptors
Similarity between concepts • P , Q : Two concepts. Assume P has less number of modifying sememe. • P_i , Q_j : ith, jth modifying sememe of P , Q. • S , T : Descriptor set of P , Q • α,β,γ : Weight of 3 parts
Similarity between words • One word may has many concepts. • Choose the most similar pair.
Amendment with thesaurus • Some words are missing and some DEFs are too rough in in HowNet. • Using Chinese thesaurus TongyiciCilin(同義詞詞林)應為哈爾濱工業大學IR-Lab的哈工大信息檢索研究室同義詞詞林擴展版 • d : Distance between W1 and W2
Similarity between words • Sim1 : Eq. 6 (Similarity in HowNet) • Sim2 : Eq. 7 (Similarity in TongyiciCilin) • α,β,γ,η : Parameters to scale the weights of the two parts.
Evaluation • Dataset • RG-65 • Rubenstein and Goodenough established synonymy judgments for 65 pairs of nouns.They invited 51 human judges to assign every pair a score between 0.0 and 4.0 to indicate semantic similarity. • MC-28 • Miller and Charles follow this idea and restricted themselves to 30 pairs of nouns selected from Rubenstein and Goodenough’s list, divided equally amongst words with high, intermediate and low similarity. • For measuring similarity between Chinese words , translate RG-65 into Chinese manually.
Evaluation • Parameters • Similarity between sememes • Strategy 1 : α = 1.6 , β = 0.16 • Strategy 2 : α = 0.2 , β = 0.16 • Similarity between concepts • α = 0.54 , β = 0.36 , γ = 0.1 • Similarity between words • On Chinese dataset :α = 0.95,β = 0.05,γ = 0.95,η = 0.05 • On English dataset : α = 0.95,β = 0.05,γ = 0.45,η = 0.55
Result • HAPI : HowNet_Get_Concept_Similarity in HowNet API
Result • In addition, They compare results to eight groups of measures that rely on WordNet. • Table 1. Correlations coefficient of algorithms