260 likes | 274 Views
Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web. Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou Euripides G.M. Petrakis Evangelos Milios. Semantic Similarity.
E N D
Semantic Similarity Methods in WordNet andTheir Application to Information Retrieval onthe Web Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou Euripides G.M. Petrakis Evangelos Milios ACM WIDM'2005
Semantic Similarity • Semantic Similarity relates to computing the conceptual similarity between terms which are not lexicographically similar • “car” “automobile” • Map two terms to an ontology and compute their relationship in that ontology ACM WIDM'2005
Objectives • We investigate several Semantic Similarity Methods and we evaluate their performance • http://www.ece.tuc.gr/similarity • We propose the Semantic Similarity Retrieval Model (SSRM) for computing similarity between documents containing semantically similar but not necessarily lexicographically similar terms • http://www.ece.tuc.gr/intellisearch ACM WIDM'2005
Ontologies • Tools of information representation on a subject • Hierarchical categorization of terms from general to most specific terms • object artifact construction stadium • Domain Ontologies representing knowledge of a domain • e.g., MeSH medical ontology • General Ontologies representing common sense knowledge about the world • e.g., WordNet ACM WIDM'2005
WordNet • A vocabulary and a thesaurus offering a hierarchical categorization of natural language terms • More than 100,000 terms • An ontology of natural language terms • Nouns, verbs, adjectives and adverbs are grouped into synonym sets (synsets) • Synsets represent terms or concepts • stadium, bowl, arena, sports stadium – (a large structure for open-air sports or entertainments) ACM WIDM'2005
WordNet Hierarchies • The synsets are also organized into senses • Senses: Different meanings of the same term • The synsets are related to other synsets higher or lower in the hierarchy by different types of relationships e.g. • Hyponym/Hypernym (Is-A relationships) • Meronym/Holonym (Part-Of relationships) • Nine noun and several verb Is-A hierarchies ACM WIDM'2005
A Fragment of the WordNet Is-A Hierarchy ACM WIDM'2005
Semantic Similarity Methods • Map terms to an ontology and compute their relationship in that ontology • Four main categories of methods: • Edge counting: path length between terms • Information content: as a function of their probability of occurrence in corpus • Feature based: similarity between their properties (e.g., definitions) or based on their relationships to other similar terms • Hybrid: combine the above ideas ACM WIDM'2005
Example • Edge counting distance between “conveyance” and “ceramic” is 2 • An information content method, would associate the two terms with their common subsumer and with their probabilities of occurrence in a corpus ACM WIDM'2005
Semantic Similarity on WordNet • The most popular methods are evaluated • All methods applied on a set of 38 term pairs • Their similarity values are correlated with scores obtained by humans • The higher the correlation of a method the better the method is ACM WIDM'2005
Evaluation ACM WIDM'2005
Observations • Edge counting/Info. Content methods work by exploiting structure information • Good methods take the position of the terms into account • Higher similarity for terms which are close together but lower in the hierarchy e.g., [Li et.al. 2003] • Information Content is measured on WordNet rather than on corpus [Seco2002] • Similarity only for nouns and verbs • No taxonomic structure for other p.o.s ACM WIDM'2005
http://www.ece.tuc.gr/similarity ACM WIDM'2005
Semantic Similarity Retrieval Model (SSRM) • Classic retrieval models retrieve documents with the same query terms • SSRM will retrieve documents which also contain semantically similar terms • Queries and documents are initially assigned tfxidf weights • q=(q1,q2,…qN) , d=(d1,d2,…dN) ACM WIDM'2005
SSRM • Query term re-weighting similar terms reinforce each other • Query term expansion with synonyms and similar terms • Document similarity ACM WIDM'2005
Query Term Expansion ACM WIDM'2005
Observations • Specification of T ? • Large T may lead to topic drift • Word sense disambiguation for expanding with the correct sense • Expansion with co-concurring terms? • SVD, local/global analysis • Semantic similarity between terms of different parts of speech? • Work with compound terms (phrases) ACM WIDM'2005
Evaluation of SSRM • SSRM is evaluated through intellisearcha system for information retrieval on the WWW • 1,5 Million Web pages with images • Images are described by surrounding text • The problem of image retrieval is transformed into a problem of text retrieval ACM WIDM'2005
http://www.ece.tuc.gr/intellisearch ACM WIDM'2005
Methods • Vector Space Model (VSM) • SSRM • Each method is represented by a precision/recall plot • Each point is the average precision/recall over 20 queries • 20 queries from the list of the most frequent Google image queries ACM WIDM'2005
Experimental Results ACM WIDM'2005
MeSH and MedLine • MeSH: ontology for medical and biological terms by the N.L.M. • 22,000 terms • MedLine: the premier bibliographic medical database of N.L.M. • 13 Million references ACM WIDM'2005
Evaluation on MedLine ACM WIDM'2005
Conclusions • Semantic similarity methods approximated the human notion of similarity reaching correlation up to 83% • SSRM exploits this information for improving the performance of retrieval • SSRM can work with any semantic similarity method and any ontology ACM WIDM'2005
Future Work • Experimentation with more data sets (TREC) and ontologies • Extend SSRM to work with • Compound terms • More parts of speech (e.g., adverbs) • Co-occurring terms • More terms relationships in WordNet • More elaborate methods for specification of thresholds ACM WIDM'2005
Try our system on the Web • Semantic Similarity System: http://www.ece.tuc.gr/similarity • SRRM: http://www.ece.tuc.gr/intellisearch ACM WIDM'2005