540 likes | 749 Views
OMIOTIS: A Thesaurus-based Measure of Semantic Relatedness. George Tsatsaronis DB-NET Research Team, A.U.E.B. – http://www.db-net.aueb.gr SKEL N.C.S.R. “DEMOKRITOS” - http://www.iit.demokritos.gr/skel/ Web: http://www.db-net.aueb.gr/gbt/ e-mail: gbt@aueb.gr , gbt@iit.demokritos.gr
E N D
OMIOTIS: A Thesaurus-based Measure of Semantic Relatedness George Tsatsaronis DB-NET Research Team, A.U.E.B. – http://www.db-net.aueb.gr SKEL N.C.S.R. “DEMOKRITOS” - http://www.iit.demokritos.gr/skel/ Web: http://www.db-net.aueb.gr/gbt/ e-mail: gbt@aueb.gr, gbt@iit.demokritos.gr Joint work with Iraklis Varlamis and Michalis Vazirgiannis
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem← • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
The Problem of Lexical Ambiguity • Syntactic Ambiguity • A word can be found with different POS in text • i.e., “Oxides and hydroxides of metals and ammonia are included in bases.” • and “He baseshis claim on some observation.” • Semantic Ambiguity • A word can occur with different meanings in text • i.e. “The old car needed constant attention.” • and “The troops stood at attention.” • A word can occur as part of a phrase • i.e. “United States of America were founded by thirteen colonies of Great Britain.” • Stemming • i.e., is disturb the stem of the verb disturb, or of the noun disturbances, which can also have other meanings? N.C.S.R. "Demokritos", December 2008
Problems Propagate • Text Retrieval • Text Classification • Paraphrasing • Other problem aspects • Machine Translation, Summarization N.C.S.R. "Demokritos", December 2008
Impact of Syntactic and Semantic Ambiguity in Text Retrieval (1/2) N.C.S.R. "Demokritos", December 2008
Impact of Syntactic and Semantic Ambiguity in Text Retrieval (2/2) N.C.S.R. "Demokritos", December 2008
Problem Nature of Semantic Ambiguity • Polysemy: A word can have different meanings in different contexts (i.e. sentences, texts). • Thesauri, like WordNet, give us the possible meanings (senses) of any dictionary word. • WordNet uses synonym sets, called synsets, to represent the words’ senses. • i.e. the noun bank has 10 different synsets in WordNet. N.C.S.R. "Demokritos", December 2008
Lexical Resources • Machine Readable Dictionaries (MRDs) – like Collins English Dictionary (CED), Oxford Advanced Learner’s Dictionary (OALD), Longman Dictionary of Ordinary Contemporary English (LDOCE). • Thesauri, like WordNet, Roget’s (lately available with Java API), EuroWordNet • All MRDs for each word they provide • Possible parts of speech (POS) • Possible meanings and respective definitions (glosses) • Usage examples • Thesauri add semantic relations (usually symmetrical) • Hierarchical: Hypernym/Hyponym, Meronym-Troponym/Holonym, etc. • Horizontal: Antonym/Synonym, Domain, Entailments/Causes, etc. N.C.S.R. "Demokritos", December 2008
WordNet – an often used thesaurus • Developed by Princeton, more than 200.000 synsets. • Versions 2.0 and 2.1 come with semantic relations crossing POS. • The most widely used thesaurus in the WSD literature since 2000. • Senseval 2, Senseval 3, SemCor and SemEval are manually annotated on WordNet 2.0 and 2.1. N.C.S.R. "Demokritos", December 2008
How to tackle with Lexical Ambiguity • Syntactic Ambiguity • POS Tagging (Brill, Viterbi algorithm, MaxEnt) • Semantic Ambiguity • Word Sense Disambiguation (knowledge-based, corpus-based, hybrid) • Phrase Detection (dictionary look up) • Questions raise: • How to combine all these pieces of information? • Is there any other way to address lexical ambiguity? • A measure of semantic relatedness combining lexical and semantic features? N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches← • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
C0 D0 E0 C11 C12 C13 D11 D12 E11 E12 Ckm Cpq WSD Overall Idea • What is the semantic relatedness between Ckm and Cpq ? … Term2 Term1 N.C.S.R. "Demokritos", December 2008
Notation • len(ci,cj) is the length of the shortest path • depth(ci) = len(root, ci) is the depth of a node • lso(ci,cj) is the lowest super-ordinate (or most specific common subsumer) of ci, cj. • Given any rel(ci,cj), the rel(wi,wj) is simply: N.C.S.R. "Demokritos", December 2008
Dictionary-based Semantic Relatedness • Wu and Palmer (1994) • Hirst and St-Onge (1998) • Leacock and Chodorow (1998) • Veale (2004) N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches ← • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
Overall Idea Term t1 Term t2 r11 r21 • Core Idea: “A word is characterized by the company it keeps” r12 r22 … … Large Corpus (i.e., BNC, Wikipedia) r1n r2n Vector representation of terms based on features, like frequency of co-occurence Frequencies are transformed by a variety of formulas and weights. Use of techniques like LSA (SVD) is also a potential. Relatedness can then measured through cosine of the angle created by the two vectors. N.C.S.R. "Demokritos", December 2008
Corpus-based Semantic Relatedness • PMI-IR (Turney 2001) N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches ← • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
C0 C11 C12 C13 Ckm Cpq Overall Idea TF(CC0) = F(FOC(C11),FOC(C12),FOC(C13)) Frequencies of Occurrence propagate FOC(C1j) = F(FOC(C2i)) … FOC(Ckm) FOC(Cpq) N.C.S.R. "Demokritos", December 2008
Hybrid Approaches • Resnik (1995) • Jiang and Conrath (1997) • Lin (1998) N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS← • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
OMIOTIS: A Thesaurus-based measure of Semantic Relatedness • OMIOTIS is a dictionary-based measure of semantic relatedness. • It does not require any type of training. It relies in the use of WordNet. • For the first time, three important factors are considered in tandem: • Semantic path length • Depth of senses comprising the path • Importance of the semantic edge types N.C.S.R. "Demokritos", December 2008
Semantic Networks Construction • Veronis and Ide [Veronis and Ide 1990] developed the first method that utilizes semantic networks to disambiguate open class words. • It was one of the first formal semantic network definitions N.C.S.R. "Demokritos", December 2008
Incorporating more semantic information • Tsatsaronis et. Al. (2007) proposed a new method for constructing Semantic Networks and use of spreading of activation to process them. • Incorporated all of the available semantic information and developed a new strategy to spread the activation • Developed an edges weighting scheme respective to the TF-IDF. N.C.S.R. "Demokritos", December 2008
Edge Weights and Activation Control • Edge weights are given by: • Activation is spread by: • Fan-out and distance constraint to prevent network from overflow. N.C.S.R. "Demokritos", December 2008
OMIOTIS: Semantic Compactness Definition 1. Given a word thesaurus O, a weighting scheme for the edges that assigns a weight e in (0, 1) for each edge, a pair of senses S = (s1, s2), and a path of length l connecting the two senses, the semantic compactness of S (SCM(S,O)) is defined as where e1, e2, ..., el are the path’s edges. If s1 = s2 SCM(S,O) = 1. If there is no path between s1 and s2 SCM(S,O) = 0. N.C.S.R. "Demokritos", December 2008
OMIOTIS: Semantic Path Elaborration Definition 2. Given a word thesaurus O and a pair of senses S = (s1, s2), where s1,s2 in O and s1 is not s2, and a path between the two senses of length l, the semantic path elaboration of the path (SPE(S,O)) is defined as , where diis the depth of sense siaccording to O, and dmax the maximum depth of O. If s1 = s2, and d = d1 = d2 SPE(S,O) = d dmax. If there is no path from s1 to s2, SPE(S,O) = 0. N.C.S.R. "Demokritos", December 2008
OMIOTIS: Semantic Relatedness Definition 3. Given a word thesaurus O, and a pair of senses S = (s1, s2) the semantic relatedness of S (SR(S,O)) is defined as max{SCM(S,O) ・ SPE(S,O)}. N.C.S.R. "Demokritos", December 2008
Computation of Semantic Relatedness N.C.S.R. "Demokritos", December 2008
OMIOTIS Where: N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness ← • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
Word-to-Word Data Sets • 65 pairs of words (R&G) • 30 pairs of words (M&C) • Word-Similarity-353 Collection (Finkelstein et AL. 2006) • For all pairs, we have human judgements (“gold standards”) • Evaluation takes place with measuring Spearman Correlation from the human judgements ranked list • Other measures have also been used, based on Kendall’s Tau N.C.S.R. "Demokritos", December 2008
Example (R&G) N.C.S.R. "Demokritos", December 2008
Results N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy← • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
Scholastic Aptitude Test (SAT) • Given a pair of words, find the most relevant pair to it, among 5 more pairs of words. • The key is to find the pair that keeps among all possible aspects the semantic analogies with the initial one. N.C.S.R. "Demokritos", December 2008
Results in the 374 SAT Collection OMIOTIS Scores 131/374 (35%), and if horizontal and vertical relatedness are combined, it reaches 198/374 (52,94%) N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness← • Paraphrasing • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
The 50-documents Collection • Michael D. Lee et Al. (2005) created a dataset where for all possible pairs among 50 documents, 83 subjects assigned a score of similarity for each one. • Documents vary from 51 to 126 words. • Data Set was assessed on whether it is within the normal range of standard English text, according to four language models • Log-normal, generalized inverse Gauss-Poisson, Yule-Simon and Zipfian. • The data set was found to be within normal range in terms of word frequency spectrum and vocabulary growth. N.C.S.R. "Demokritos", December 2008
Results on the 50 document collection • The average ‘inter-rater’ correlation was 0.605 • Cosine correlation with humans, scores 0.27 (bag of words representation and TF-IDF weighting). • OMIOTIS (early results) shows a correlation of above 0.45. • LSA based techniques score 0.6, but need training. N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing← • Text Retrieval • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
Microsoft Research Paraphrase Corpus • 5801 pairs of sentences gleaned over a period of 18 months. • Each pair of sentences was deemed as a paraphrase pair (1) or not (0). Two judges, with disagreements being resolved by a third judge. • After judges disagreement resolutions, 67% were judged semantically equivalent. N.C.S.R. "Demokritos", December 2008
Paraphrase Results • The table shows error reduction rates (%) from the standard vectorial model in the paraphrase task. N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval← • Limitations • Future Work N.C.S.R. "Demokritos", December 2008
TREC 1, 4 and 6 N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations← • Future Work N.C.S.R. "Demokritos", December 2008
Limitations • Scaling prior to the construction of the huge database was infeasible. • Corpora, like Wikipedia, offer tremendous amounts of pieces of information. OMIOTIS does not take corpora information into account. • Context is not really taken into account, as WSD is not conducted. N.C.S.R. "Demokritos", December 2008
Presentation Layout • Lexical Ambiguity Problem • State of the art • Dictionary-based Approaches • Corpus-based Approaches • Hybrid Approaches • OMIOTIS • Applications • Word-to-Word Relatedness • SAT Analogy • Text-to-Text-Relatedness • Paraphrasing • Text Retrieval • Limitations • Future Work ← N.C.S.R. "Demokritos", December 2008
Future Work • Embed WSD information • Combine information rising from huge corpora • Combine thesauri (i.e., use Roget’s as well) Some interesting working ideas • Model the impact of ambiguity in text retrieval (similar attempts were made by Sanderson, using pseudowords) • Combine the indexing of OMIOTIS distances, with a SoA IR platform, like Terrier. This will allow for online searching. N.C.S.R. "Demokritos", December 2008