260 likes | 415 Views
MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTIC RELATIONS USING WEB SEARCH ENGINES. Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka (WSDM’09) Speaker : Yi-Ling Tai Date : 2009/11/23. OUTLINE. Introduction Method Retrieving Contexts Extracting Lexical Patterns
E N D
MEASURING THE SIMILARITY BETWEEN IMPLICIT SEMANTICRELATIONS USING WEB SEARCH ENGINES Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka (WSDM’09) Speaker: Yi-Ling Tai Date:2009/11/23
OUTLINE • Introduction • Method • Retrieving Contexts • Extracting Lexical Patterns • Identifying Semantic Relations • Measuring Relational similarity • Experiments • Conclusions
INTRODUCTION • Implicit semantic relations between two words • Google, Youtube (acquisition) • Ostrich, bird (is a large) • Similar semantic relations between two words pairs • Google, Youtube → Yahoo, Inktomi • Ostrich, bird → lion, cat • This paper proposed a method to compute the similarity between implicit semantic relations in two word-pairs.
OUTLINE OF THE SIMILARITY METHOD • Web search component • query a Web search engine to find the contexts • Pattern extraction component • extract lexical patterns that express semantic relations • Pattern clustering component • cluster the patterns to identify particular relation • Similarity computation component. • compute the relational similarity between two word-pairs
RETRIEVAL CONTEXTS • Snippets - brief summaries provided by Web search engines along with the search results. • containing two words, captures the local context • query “Google * *YouTube”
RETRIEVAL CONTEXTS • “ * ” - wildcard operator, matches one word or none. • To retrieve snippets for a word pair (A,B) • “A * B”, “B * A”, “A * * B”, “B * * A”,“A * * * B”, “B * * * A”, and A B • query words co-occur within a maximum of three words • “ ” ensure that the two words appear in the order • remove duplicates • if they contain the exact sequence of all words
EXTRACTING LEXICAL PATTERNS • shallow lexical pattern extraction algorithm • extract the semantic relations between two words from web snippets. • not require language preprocessing • Consist of the following three steps • Step 1: • Replace two words with two variables X and Y • replace all numeric values by D • do not remove punctuation marks
EXTRACTING LEXICAL PATTERNS • Step 2: • Exactly one X and one Y must exist in a subsequence • The maximum length of a subsequence is L words. • Gaps should not exceed g words. • Total length of all gaps should not exceed G words. • expand all negation contractions, didn’t → did not • Step 3: • select subsequences withfrequency greater than N
EXTRACTING LEXICAL PATTERNS • a modifiedprefixspan algorithm • consider all the words in a snippet • not limited to extracting patterns from only the mid-fix • X to acquire Y, X acquire Y, X to acquire Y for.
IDENTIFYING SEMANTIC RELATIONS • A semantic relation can be expressed using more than one pattern. • If there are many related patterns between two word-pairs, we can expect a high relational similarity. • cluster lexical patterns using their distributions over word-pairs , to identify semantically related patterns.
IDENTIFYING SEMANTIC RELATIONS • p : word-pair frequency vector of pattern p • : frequency of pattern p occurs with the word-pair • SORT : sorts the patterns in the descending order of their total occurrence in all word-pairs • c : thevector sum of all word-pair frequency vectors corresponding to thepatterns that belong to that cluster. • : denote the vector addition • : similarity threshold
MEASURING RELATIONAL SIMILARITY • :feature vector of a word-pair • Elements of the feature vector, are the totalfrequencies of theword-pair in each cluster. • the relational similaritybetweentwo word-pairs • is a correlation matrix
MEASURING RELATIONAL SIMILARITY • the correlation betweenclusters and by the elementin • is the union between the two clusters
EXPERIMENTS • Dataset • 100 instances (word or named-entity pairs) • five relation types • ACQUIRER-ACQUIREE • PERSON-BIRTHPLACE • CEO-COMPANY • COMPANY-HEADQUARTERS • PERSON-FIELD
EXPERIMENTS • manually select 20 instances for each types. • Wikipedia • online newspapers • company reviews • For each instance, download snippets using YahooBOSS API
EXPERIMENTS- LEXICAL PATTERNS • Lexical Patterns • run the pattern extraction algorithm • L = 5, g = 2, and G = 4. • total number of unique patterns is 473910 • we only select the 148655patterns that occur at least twice.
EXPERIMENTS - PATTERN CLUSTERS • Ratio:singletons to total number of clusters
EXPERIMENTS-RELATION CLASSIFICATION • We evaluate the proposed relational similarity measure in a relation classification task. • k-nearest neighbor classification • classification accuracy • average precision • Rel(r):a binary valued function thatreturns 1 if the word-pair at rank r has the same relation
EXPERIMENTS-RELATION CLASSIFICATION • =0.955 • 2629 non-singleton clusters • 6930 singletons
EXPERIMENTS-RELATION CLASSIFICATION • the top 10 clusters with the largest numberof lexical patterns. • the top four patterns that occur in most number of word-pairs
RELATIONAL SIMILARITY MEASURES compare the relational similarity measures • VSM: • each word-pair is represented by a vector of pattern frequencies • the relational similarity between two word-pairs is computed as the cosine similarity • LRA: • Latent Relational Analysis • Create a matrix in which the rows represent word-pairs and the columns represent lexical patterns • singular value decomposition (SVD)
RELATIONAL SIMILARITY MEASURES • IP: • set in Formula 2 to the identity matrix • compute relation similarity using pattern clusters • CORR: • the proposed relational similarity measure.
CONCLUSIONS • We proposed a method to compute the similarity between implicit semantic relations in two word-pairs. • only a few queries to compute • quickly computerelational similarity for unseen word-pairs • a general framework- designing relational similarity measures can be modeled as searching for a matrix