Co-training & Self-training for Word Sense Disambiguation

Co-training & Self-training for Word Sense Disambiguation Author: RadaMihalcea

Introduction • Supervised learning -> Best performance but limited to only words with sense tagged data available & accuracy depends on the amount of labeled data present • Methods for building sense classifiers using less annotated data explored • Applicability of co-training & self-training methods to supervised word sense disambiguation investigated • Bootstrapping parameters tweaked for optimal performance

Bootstrapping • Independent views represented by two different feature sets based on local versus topical feature split provided for co-training • Self-training requires only one classifier and no feature split is involved • Class distribution ratio in labeled data is kept constant to avoid imbalance in training data • Parameters to be optimized: Iterations (I), Pool size(P)& Growth size (G)

Supervised Word Sense Disambiguation • Preprocessing : • Removal of SGML tags • Tokenization • Parts of speech annotation • Collocation removal • Issues: (1) selection of best features (2) choice of learning algorithm

Classifiers • Naive Bayes • Local classifier: uses all local features - > in co-training • Topical classifier: uses SK feature (10 keywords/word sense each occurring at least 3 times in the annotated corpus) -> in co-training • Global classifier: combination of local and topical classifiers -> in self-training • One classifier for each word in supervised learning => co-training & self-training shows heterogeneous behavior and best parameters for both are different for each classifier

Parameter optimization • Determine an optimal parameter setting for each word in data set • Explore different algorithms for bootstrapping parameter selection: • Best overall parameter setting • Best individual parameter setting • Best per word parameter selection • New method with improved bootstrapping scheme using majority voting across several iterations

Optimal Settings • Measurements performed on test set • 40 iterations performed for each setting • Experiments performed separately for co-training & self-training • Best set of values (G, P, I) determined for each word • Baseline - Global classifier

Observations • Co-training & self-training have same performance under optimal settings • Words having high baseline classifier performance do not show any improvement with either co-training or self-training • Commonalities among optimal parameters for different classifiers absent

Empirical Settings • Experimental determination of optimal parameter values might be difficult • 20% of training data used for determining empirical solutions • For each run, G, P, I precision of base classifier & boosted classifier are noted • Expt1 : Determine the total relative growth in performance for each possible parameter setting, by adding up the relative improvements for all the runs for that particular setting • Next the value for each parameter is determined independent of the other parameters following a similar approach

The value leading to the highest growth is selected • Both co-training & self-training have the same set of parameter values which gives the highest growth • Average results are worse than the baseline • Expt2 : Best parameter values identified for each word • Performance of base classifier better

Majority voting • Bootstrapping learning curves first exhibit non uniform growth and decline rate • Values at which maximum and minimum is reached varies for different classifiers • Combining co-training and self-training with majority voting slows the learning rate as well produces a larger constant performance interval • Performance better than baseline for a greater interval of iterations

The parameter settings evaluation repeated for the smoothed co-training and self-training with majority voting • Co-training results improve • Self-training results do not show significant improvement

Discussions • Dependencies might be present between features since they are extracted from the same context • Words with accurate base classifiers show no improvement • Words with higher number of senses show no improvement • Words having large subsets of their senses belonging to different domains show little or no improvement

Questions?

Co-training & Self-training for Word Sense Disambiguation