260 likes | 439 Views
Three Approaches to Unsupervised WSD. Dmitriy Dligach. Unsupervised WSD. No training corpora needed No predefined tag set needed Three approaches Context-group Discrimination (Schutze, 1998) Graph-based Algorithms (Agirre et al., 2006) HyperLex (Veronis, 2004)
E N D
Three Approaches to Unsupervised WSD Dmitriy Dligach
Unsupervised WSD • No training corpora needed • No predefined tag set needed • Three approaches • Context-group Discrimination (Schutze, 1998) • Graph-based Algorithms (Agirre et al., 2006) • HyperLex (Veronis, 2004) • PageRank (Brin and Page, 1998) • Predominant Sense (McCarthy, 2006) • Thesaurus generation • Method in (Lin, 1998) • Earlier version in (Hindle, 1990)
Context-group Discrimination Algorithm • Sense Representations • Generate word vectors • Generate context vectors (from co-occurrence matrix) • Generate sense vectors (by clustering context vectors) • Disambiguate by computing proximity
Word Vectors • wi • Two strategies to select dimensions • Local: select words from the contexts of the ambiguous word within a 50-word window • Either 1,000 most frequent words, or • Use 2 measure of dependence to pick 1,000 words • Global: select from the entire corpus regardless of the target word • Select 20,000 most frequent words as features • 2,000 as dimensions • 20,000-by-2,000 co-occurrence matrix
Context Vectors • This representation conflates senses • Represent context as the centroid of the word vectors • IDF-valued vectors
Sense Vectors • Cluster approx. 2,000 context vectors • Use a combination of group-average agglomerative clustering and EM • Choose a random sample of 50 (2000) and cluster using GAAC O(n2) • Centroids of the resulting clusters become the input to the EM • The procedure is still linear • Perform an SVD on context vectors • Re-represent context vectors by their values on the 100 principal dimensions
Evaluation • Hand-labeled corpus of 10 naturally ambiguous and 10 artificial words • Throw out low-frequency senses and leave only 2 most frequent • Number of clusters • 2 clusters: use gold standard to evaluate • 10 clusters: no gold standard; use purity • Sense-based IR
Results (highlights) • Overall performance for pseudo-words is higher than for naturally ambiguous words • Some pseudowords (wide range/consulting firm) and words (space in area, volume sense) show poor performance due to being topically amorphous • IR evaluation • vector-space model with senses as dimensions • 7.4% improvement on TREC-1 collection
Graph-based Algorithms • Build a co-occurrence matrix • View it as a graph • Small world properties • Most nodes have few connections • Few are highly connected • Look for densely populated regions • Known as High-Density Components • Map ambiguous instances to one of these regions
A Sample Co-Occurrence Graph • barrage – dam, play-off, barrier, roadblock, police cordon, barricade
Algorithm Details • Nodes correspond to words • Edges reflect the degree of semantic association between words • Model with conditional probabilities • wA,B = 1 – max[p(A|B), p(B|A)] • Detect high-density components • Sort nodes by their degree • Take the top one (root hub) and remove along with all its neighbors (hoping to eliminate the entire component) • Iterate until all the high-density components are found
Disambiguation • Delineate high-density components • Need to attach them back to the root hubs • Attach the target word to all root hubs • Compute the MST • Map the ambiguous instance to one of the components • Examine each word in its context • Compute the distance from each of these words to each root hub (each word is under exactly one hub) • Compute the total score for each hub
PageRank • Based on PageRank (Brin and Page, 1998) and adopted for weighted graphs • An alternative way to rank nodes • Algorithm • Initialize nodes to random values • Compute PageRank • Iterate a fixed number of times
Evaluation • First need to optimize 10 parameters • P1.Minimum frequency of edges (occurrences) • P2.Minimum frequency of vertices (words) • P3.Edges with weights above this value are removed • Train on Senseval2 using unsupervised metrics • Entropy, Purity, and Fscore • Evaluate on Senseval3 • Lexical sample data • 10 point gain over the MFS baseline • Beat by 1 point a supervised system with lexical features • All-words task • Little training data • Supervised systems barely beat the MFS baseline • This system is less than 1 point below the best system • The difference in performance is not statistically significant
Finding Predominant Sense • Predominant senses in WordNet are derived from SemCor (a relatively small subset of Brown) • Idiosyncrasies • tiger (audacious person not the animal) • star (depending on context celebrity or celestial body)
Distributional Similarity • Nouns that occur in object positions of the same verbs are similar (e.g. beer and vodka as objects of to drink) • Can automatically generate thesaurus-like neighborhood list for the target word (Hindle 1990), (Lin 1998) • w0:s0, w1:s1, …, wn:sn • neighborhood list conflates different senses • quality and quantity of neighbors must relate to the predominant sense • need to compute the proximity of each neighbor to each of the senses of the target word (e.g. lesk, jcn)
Algorithm • w – the target word • Nw = {n1, n2, …, nk} – the ordered set of top k most similar neighbors of the target word • {dss(w, n1), dss(w, n2), …, dss(w, nk)} – distributional similarity score for each of the k neighbors • wsi senses(w) – senses of the target word • wnss(wsi, nj) – WordNet similarity score between WordNet sense i of the target word and the sense njof the neighbor j that maximizes this score • PrevalenceScore(wsi) – ranking of the sense i of the target word as being the predominant sense.
Experiment 1 • Derive a thesaurus from BNC • SemCor experiments • Metric: accuracy of finding the MFS • Metric: WSD accuracy • Baseline: random accuracy • Upper bound for WSD task is 67% • Both experiments beat the random baseline (54% and 48% respectively) • Hand Examination • some error due to genre and time period variations
Experiment 2 • Use Senseval2 all-words task • Label with first sense computed • automatically • according to SemCor • Senseval2 data itself (upper bound) • Automatic precision/recall are only a few points less than SemCor’s
Experiment 3 • Investigate how the MFS changes across domains • SPORTS and FINANCE domains of the Reuters corpus • No hand annotated data, so hand-examine • Most words displayed the expected change in MFS • tiechanges from draw to affiliation
Discussion: Algorithms • Context • Bag-of-words: Schutze and Agirre et. al. • Syntactic: McCarthy et. al. • Is bag-of-words sufficient? • E.g. topically amorphous words • Co-occurrence • Co-occurrence matrix: Schutze and Agirre et al. • Used to to look for similar nouns: McCarthy et.al • Order of co-occurrence • First order: all three papers • Second order: Schutze and McCarthy et. al. • Higher-order: Agirre • PageRank computes global rankings • MST links all nodes to the root • Advantage of the graph-based methods
Discussion: Evaluation • Testbeds: little ground for cross-comparison • Schutze: his own corpus • Agirre et al: train parameters on Senseval2 and test on Senseval3 data • McCarthy et al: test on SemCor, Senseval2, Reuters • Methodology • Map clusters to the gold standard (Schutze and Agirre et. al.) • Unsupervised evaluation (Schutze and Agirre et. al) • Compare to various baselines (MFS, Lesk, Random baseline) • Use an application (Schutze)