60 likes | 209 Views
Semi-supervised learning for protein classification. Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY. The problem.
E N D
Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY
The problem • Develop computational models of characteristics of protein structure and function from sequence alone using machine-learned classifiers • Input: Data • Output: A model (function) h : X Y • Traditional approach: supervised learning • Challenges: • Experimentally determined data – Expensive, limited, subject to noise/error • Large repositories of unannotated data • Data representation, bias from unbalanced / underrepresented classes, etc. TrEMBL 37.5: 5,035,267 Swiss-Prot 54.5: 289,473 AIM: Develop a method to use labeled and unlabeled data, while improving performance given the challenges presented by small, unbalanced data
Solution • Semi-supervised learning • Use Dl and Du for model induction • Method: Generative, Bayesian probabilistic model • Based on ngLOC – supervised, Naïve Bayes classification method • Input / Feature Representation: Sequence n-gram model • Assumption – multinomial distribution • IID – Sequence and n-grams • Use EXPECTATION MAXIMIZATION! • Test setup • Prediction of subcellular localization • Eukaryotic, non-plant sequences only • Dl : Data annotated with subcellular localization for eukaryotic, non-plant sequences • DL-2 – EXT/PLA (~5500 sequences, balanced) • DL-3 – GOL [65%] / LYS [14%] /POX [21%] (~600 sequences, unbalanced) • Du : Set from ~75K eukaryotic, non-plant protein sequences. • Comparative method • Transductive SVM
Algorithms based on EM EM-λ on DL-3 data • λ – controls effect of UL data on parameter adjustments • ALL labeled data (~600) • Varied UL data • EM- λ outperforms TSVM on this problem • (Failed to converge on large amounts of UL data, despite parameter selection) • NOTE – TSVM performed very well on binary, balanced classification problems Basic EM on DL-2 • Varied labeled data • 25,000 UL sequences • Most improvement when data is limited
Algorithm – EM-CS • Core ngLOC method outputs a confidence score (CS) • Improve running time through intelligent selection of unlabeled instances • CS(xi) > CSthresh? Use the instance • Test on DL-3 data: First, determine range of CS scores through cross-validation without UL: 33.5-47.8 (Dependent on level of similarity in data, size of dataset.) Using only sequences that meet or exceed CSthresh significantly reduces UL data required (97.5% eliminated) NOTE: it is possible to reduce UL data too much.
Conclusion • Benefits: • Probabilistic • Extract unlabeled sequences of “high-confidence” • Difficult with SVM or TSVM • Extraction of knowledge from model • Discriminative n-grams and anomalies • Information theoretic measures, KL-divergence, etc. • Again, difficult with SVM or TSVM • Computational resources • Time: Significantly lower than SVM and TSVM • Space: Dependent on n-gram model • Can use large amounts of unlabeled data • Applicable toward prediction of any structural or functional characteristic • Outputs a global model • Transduction is not global! • Most substantial gain with limited labeled data • Current work in progress: • TSVMs • Improve performance on smaller, unbalanced data • Select an improved smaller dimensional feature space representation • Ensemble classifiers, Bayesian model averaging, Mixture of experts