Computing Semantic Relatedness

Computing Semantic Relatedness B.Tech Project – Second Stage Rohitashwa Bhotica (04005010) Under the guidance of :- Prof. Pushpak Bhattacharyya

OUTLINE • Introduction • Wiktionary • Semantic Relatedness • Page Rank • Implementation Steps • Results and Testing • Conclusion

Introduction • Computing Semantic Relatedness between words has uses in various applications • Many measures exist, all using WordNet • Wiktionary models lexical semantic knowledge similar to conventional WordNets • Wiktionary can be a substitute to WordNet • We see how Concept-Vector and PageRank is used to measure Semantic Relatedness using Wiktionary as a corpus

Wiktionary • Freely available, multilingual, web based dictionary in over 151 languages • Project by WikiMedia foundation • Written collaboratively by online volunteers • The English version has over 800,000 entries • Contains many relation types such as synonyms, etymology, hypernymy, etc.

Comparison with WordNets

Differences between WordNet & Wiktionary • Wiktionary constructed by users on web rather than by expert linguists • This reduces creation costs and increases size and speed of creation of entries • Wiktionary is available in more languages • Wiktionary schema is fixed but not enforced • Older entries not updated hence inconsistent • Wiktionary entries not necessary complete and may contain stubs. Not symmetrical also

Similarities Between Wiktionary & WordNet • Wiktionary contains concepts connected to each other by lexical semantic relations • Have glosses giving short descriptions • Size of all major languages are large • Wiktionary articles are monitored by the community on the web just like WordNet

Structure of Wiktionary Entry • Is in XML format with tags for title, author, creation date, comments, etc. • Meanings and various forms with examples • List of synonyms and related terms • Linked to other words represented by “[[ ]]” • Contains list of translations of word in other languages and categories to which it belongs • Pronunciation and rhyming words as well

Example • http://en.wiktionary.org/wiki/bank • We can see the various meanings for the different forms of the word “bank” • List of derived and related terms present • Contains translations into other languages

Semantic Relatedness • Defines resemblance between two words • More general concept than similarity • Similar and dissimilar entries can be related by lexical relationships such as meronymy • Cars-petrol more related than cars-bicycle which is more similar • Humans can judge easily unlike computers • Computers need vast amount of common sense and world knowledge

Measures of Semantic Relatedness • Concept – Vector Based Approach • Word represented as high dimensional concept vector, v (w) = (v1,…, vn), n is no. of documents • The tf.idf score is stored in vector element • Vector v represents word w in concept space • Semantic Relatedness can be calculated using:- • This is also known as cosine similarity and the score varies from 0 to 1

Measures of Semantic Relatedness • Path – Length Based Measure • Computes Semantic Relatedness in WordNet • Views it as a graph and sees path length between concepts. “Shorter the path, the more related it is” • Good results when path consists of is-a links • Concepts are nodes and semantic relations between these concepts can be treated as edges • SR calculated by relPL (c1, c2) = Lmax – L (c1, c2) • Lmax is length of longest non-cyclic path and L (c1, c2) gives number of edges from concept c1 to c2

Measures of Semantic Relatedness • Problem is that is considers all links to be uniform in distance which may not be the case always • Many improvements using Information Content • The Resnik Measure • Information content based relatedness measure • Higher information content specific to particular topics, lower ones specific to more general topics • Carving fork – HIGH, entity – LOW • Idea is that two concepts are semantically related proportional to the amount of information shared

Measures of Semantic Relatedness • Considers position of nouns in is-a hierarchy • SR is determined by information content of lowest common concept which subsumes both concept • For example: Nickel and Dime subsumed by Coin, Nickel and Credit card by Medium of Exchange • P(c) is probability of encountering concept c. • If a is-a b, then p(a) is less than equal to p(b) • Information content calculated by formula:- IC (concept) = – log (P (concept))

Measures of Semantic Relatedness • Thus relatedness is given by:- Simres (c1, c2) = IC (LCS (c1, c2)) • Does not consider information content of the concepts themselves nor path length • Problems faced is that many concepts might have the same subsumer thus having same score • May get high measures on the basis of some inappropriate word senses. E.g tobacco and horse • Newer methods such as Jiang-Conrath, Linand Leacock-Chodorow measures

Page Rank • Developed by Larry Page and Sergei Brinn • Link analysis algorithm assigns numerical weighting to hyperlinked set of documents • Measures relative importance of page in set • Link to a page is a vote of support which increases the rank of that particular page • It is a probability distribution representing the likelihood of a person randomly clicking ultimately ending up on a specific page

Simplified Algorithm • Assume universe has 4 pages A, B, C and D • Initial values of all the pages is 0.25 • Now suppose B, C and D link only to A • Rank of A given by:- • If B links to other pages also then rank of A:- • L(B) is the number of outbound links from B

Simplified Algorithm • Page rank of U depends on rank of page V linking to U divided by number of links from V • Page Rank can be given by general formula:- • Formula applicable for pages which link to U • Thus we can see that the page ranks of all pages in corpus will be equal to 1

Final Algorithm • Damping Factor : Imaginary surfer will stop clicking at links after some time. • d is probability that user will continue clicking • Damping factor is estimated at 0.85 here • The new page rank formula using this is:- • Now to get actual rank of a page we will have to iterate this formula many times • Problem of Dangling Links

Page Rank in our Implementation • Wiktionary contains link structure in articles • Page Rank of every word in corpus can be calculated using same algorithm • Higher ranking words more probability of occurrence in random clicking • Algorithm iterated 30 times • Problem is that link structure is not symmetric and can be improved

Implementation Steps • We use Wiktionary corpus dated 15th Mar,08 • Parsing:- • Split large Wiktionary dump file into smaller files • Parse articles removing irrelevant information such as comments, leaving only content words • Content words consist of words in glosses of article and synonyms, antonyms, etc. of word • Content words are then stemmed with Porter stemmer to maintain uniformity for all words • Stop words are removed to leave only main words

Implementation Steps • Calculating SR using C-V based approach :- • We have list of all words and their content words • Using each word as a different document calculate the concept vector of each word • Calculate SR using these concept vectors • Example :-

Implementation Steps • List of Linked Words :- • Each linked word is enclosed in “[[ ]]”s • We parse Wiktionary and store all these words • Calculating Page Rank :- • We have list of all links for all words in corpus • Words not linking to any are linked to all the words to solve dangling links problem • Use Page and Brinn algorithm to calculate the rank now

Implementation Steps • Calculating SR using Page Rank :- • Concept vector of each word already computed • Multiply each element of concept vector by its corresponding Page Rank • Compute cosine similarity using these vectors • Example :-

Results and Testing • Miller & Charles(30), Rubenstein and Goodenough(65) and Finkelstein(353) datasets are used for testing • Pearson’s Correlation Coefficient and Spearman’s rank order Correlation Coefficients are calculated for results obtained on these datasets

Results and Testing • Pearson’s Correlation coefficient formula :- • Results :-

Results and Testing • Spearman’s correlation coefficient formula :- di = xi − yi =difference between the ranks of values Xi and Yi • All entries with 0 values are removed for this • Results :-

Conclusion • Coverage of Wiktionary is very high for datasets • Pearson and Spearman’s correlation coefficient is lesser using the second method of Page Rank • Entries in nascent stage, no well defined and symmetric link structure existing • Entries are not properly authored and edited • For tougher datasets Fin1 and Fin2 score is low • Second method will improve once link structure and structure and content of articles improve

Conclusion (contd.) • Semantic Relatedness between words can be used to solve word sense disambiguation, word choice problems, etc. • Seen features of Wiktionary and measures of calculating semantic relatedness between words • Studied the concept of Page Rank and its application in calculating semantic relatedness • Results show that Wiktionary is a good and emerging semantic resource which is going to improve in the future

Bibliography • Using Wiktionary for Computing Semantic Relatedness • Alexander Budanitsky, Graeme Hirst. Evaluating WordNet-based Measures of Lexical Semantic Relatedness, 2006 • Philip Resnik. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI-1995 • Siddharth Patwardhan, Satanjeev Banerjee, Ted Petersen. Using Measures of Semantic Relatedness for Word Sense Disambiguation, 2003 • Wikimedia foundation. Wikipedia, www.wikipedia.com • Philip Resnik, Mona Diab. Measuring Verb Similarity, 2000 • Larry Page, Sergei Brinn. The Page Rank Citation Ranking: Bringing Order to the Web, 1998

Bibliography • Wikimedia foundation. Wiktionary, www.wiktionary.com • Siddharth Patwardhan, Ted Pedersen. Using Wordnet Based Concept Vectors to Estimate the Semantic Relatedness of Concepts, 2006 • Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, Eytan Ruppin. Placing search in content: The concept revisited, ACM TOIS, 2002 • Herbert Rubenstein, John B. Goodenough. Contextual correlates of similarity, 1965 • George A. Miller, Walter G. Charles. Contextual correlates of semantic similarity, 1991 • Jay J. Jiang, David W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy, ROCLING 1997

Computing Semantic Relatedness