170 likes | 383 Views
Measuring the Semantic Similarity of Texts. Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen. Outline. Introduction Semantic Similarity of Words Semantic Similarity of Texts A Walk-Through Example Evaluation Conclusion. Introduction.
E N D
Measuring the Semantic Similarity of Texts Author:Courtney Corley and Rada Mihalcea Source:ACL-2005 Reporter:Yong-Xiang Chen
Outline • Introduction • Semantic Similarity of Words • Semantic Similarity of Texts • A Walk-Through Example • Evaluation • Conclusion
Introduction • Measures of text similarity have been used for • IR, text classification, WSD, automatic evaluation of machine translation, text summarization • The typical approach to use a simple lexical matching method, and produce a similarity score • But most text similarity metrics will fail in these texts • I own a dog • I have an animal
Introduction (cont.) • LSA measure similarity between texts by including • Similar terms in large text collections • In this paper, we explore a knowledge-based method for measuring the semantic similarity of texts • There are several methods for finding the semantic similarity of words • We combine these methods into a text-to-text semantic similarity method
Semantic Similarity of Words • The Leacock & Chodorow (Leacock and Chodorow, 1998) similarity • Length: the length of the shortest path between two concepts • D: the maximum depth of the taxonomy • The Wu and Palmer (Wu and Palmer, 1994) similarity
Semantic Similarity of Words (cont.) • The information content (IC) of the LCS • P(c): the probability of encountering an instance of concept c in a large corpus • Lin’s metric(Lin, 1998) • Jiang & Conrath (Jiang and Conrath, 1997)
Language Models • Language models are used to account for the distribution of words in language • We take into account the specificity of words • For example, • collie and sheepdog: higher weight • go and be: give less importance • TF does not always constitute a good measure of word importance • The distribution of words across an entire collection can be a good indicator of the specificity of the words --(IDF)
Semantic Similarity of Texts • A directional measure of semantic similarity • indicates the semantic similarity of a text segment Ti with respect to a text segment Tj • Sets of open-class words—N, V, Adj, Adv • Determine pairs of similar words across the sets corresponding to the same open-class in two text • For nouns and verbs, we use a measure based on WordNet • Apply lexical matching to the other word classes
Semantic Similarity of Texts (cont.) • maxSim: the highest semantic similarity of the six methods • The score is between 0 and 1 with respect to Ti • If this similarity measure results in a score greater than 0, then the word is added to the set of similar words for the corresponding word class WSpos • A bidirectional similarity
A Walk-Through Example • First, the text segments are tokenize, POS tagged • The words are inserted into word class sets
A Walk-Through Example (cont.) • We seek a WordNet-based semantic similarity for N and V • Only lexical matching for Adj, Adv, and cardinals
A Walk-Through Example (cont.) • We use • The semantic similarity with respect to text 1 as 0.6702 • With respect to text 2 as 0.7202 • A bidirection measure of similarity: 0.6952
Evaluation • To test the effectiveness of the text semantic similarity metric • Automatically identify if two text segments are paraphrases of each other • Corpus: • The Microsoft paraphrase corpus 4,076 training pairs and 1,725 test pairs • PASCAL corpus 580 development pairs and 800 test pairs • Two setting • An unsupervised setting threshold of 0.5 • An supervised setting the optimal threshold and weights associated with various similarity methods are determined through learning on training data
Evaluation (cont.) • Three baseline • Randomly choosing a true or false value for each text pair • A lexical matching which counts the number of matching words • Using tf * idf • paraphrase identification • 狗正在吃骨頭 -> 骨頭正在被狗吃 • entailment identification • 我能看見一條狗 -> 我能看見一隻動物
Conclusion • The accuracy of text semantic similarity for paraphrase identification(68.8%, 71.5%) • For the entailment data set, the accuracy 58.3 % is better than the PASCAL entailment evaluation (Dagan et al., 2005) • Our method relies on a bag-of-words approach • Improves significantly over the traditional methods • But ignores many of important relationships in sentence structure