130 likes | 374 Views
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI. RANDOM SUBSPACE METHOD (RSM). Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998 Another combining technique for weak classifiers like Bagging, Boosting. RSM ALGORITHM.
E N D
COP5992 – DATA MINING TERM PROJECTRANDOM SUBSPACE METHOD + CO-TRAININGbySELIM KALAYCI
RANDOM SUBSPACE METHOD (RSM) • Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998 • Another combining technique for weak classifiers like Bagging, Boosting.
RSM ALGORITHM 1. Repeat for b = 1, 2, . . ., B: (a) Select an r-dimensional random subspace X from the original p-dimensional feature space X. 2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule
MOTIVATION FOR RSM • Redundancy in Data Feature Space • Completely redundant feature set • Redundancy is spread over many features • Weak classifiers that have critical training sample sizes
RSM PERFORMANCE ISSUES • RSM Performance depends on: • Training sample size • The choice of a base classifier • The choice of combining rule (simple majority vs. weighted) • The degree of redundancy of the dataset • The number of features chosen
DECISION FORESTS (by Ho) • A combination of trees instead of a single tree • Assumption: Dataset has some redundant features • Works efficiently with any decision tree algorithm and data splitting method • Ideally, look for best individual trees with lowest tree similarity
UNLABELED DATA • Small number of labeled documents • Large pool of unlabeled documents • How to classify unlabeled documents accurately?
CO-TRAINING • Blum and Mitchel, “Combining Labeled and Unlabeled Data with Co-Training”, 1998. • Requirements: • Two sufficiently strong feature sets • Conditionally independent
APPLICATION OF CO-TRAINING TO A SINGLE FEATURE SET Algorithm: Obtain a small set L of labeled examples Obtain a large set U of unlabeled examples Obtain two sets F1and F2of features that are sufficiently redundant While U is not empty do: Learn classifier C1 from L based on F1 Learn classifier C2 from L based on F2 For each classifier Ci do: Ci labels examples from U based on Fi Ci chooses the most confidently predicted examples E from U E is removed from U and added (with their given labels) to L End loop
THINGS TO DO • How can we measure redundancy and use it efficiently? • Can we improve Co-training? • How can we apply RSM efficiently to: • Supervised learning • Semi-supervised learning • Unsupervised learning
QUESTIONS ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????