130 likes | 235 Views
Regression based KNN for gene function prediction using heterogeneous data sources. Zizhen Yao, Larry Ruzzo yzizhen, ruzzo @cs.washington.edu. Background. E. Coli classification schemes KEGG , COG, MultiFun Common functional classes (10-19 classes)
E N D
Regression based KNN for gene function prediction using heterogeneous data sources Zizhen Yao, Larry Ruzzo yzizhen, ruzzo @cs.washington.edu
Background • E. Coli classification schemes • KEGG , COG, MultiFun • Common functional classes (10-19 classes) • Metabolism, Translation, Transporter, Cell Motility • Biological information used for inference • Microarray expression, protein interaction, evolutionary history • Methods • Support vector machine, Bayesian, Rule-based
Introduction to KNN • Idea – for each query instance • Choose k nearest neighbors • Choose the class voted by majority of the neighbors. • Design issues • Similarity / Distance metric • Voting schemes
Algorithm Flow Chart Training Testing Training Data Testing Data For every pair of training genes, calculate the predictors. Calculate the predictors values using and training data Learn Similarity Metric Choose k nearest neighbors Voting A list of predictions with confidence scores.
Predictors • Microarray Expression Data • Expression correlation • Sequencing Data • Chromosomal position • Chromosomal distance • Transcription direction • Block indicator • Protein sequence similarity • Paralog indicator
Similarity (Distance) Metric • Classical metrics are not appropriate because predictors are • heterogeneous data type, scale • different relevance • correlated • Goal: estimate the likelihood that a pair of genes are in the same class based on predictors
Learning Similarity Metric • Regression methods • Response • Find f • Logistic regression • Local regression
Probabilistic voting scheme • Goal: estimate the probability that the query gene belong to each class. • Range: [0 ~ 1] • Assigns higher confidence score to predictions voted by more neighbors, or neighbors with higher credibility. • Report predictions that are above certain threshold value.
Results Summary • Combining all 4 predictors yields the best result. • Using expression data only, regression based KNN methods outperforms SVM. • Performance varies with different function classes • Confidence scores are strongly correlated with accuracy.
Contribution • KNN • Simplicity, efficiency, flexibility • Easy to interpret the results, useful to guide case studies • Similarity metric • integrate heterogeneous data sources • voting scheme • Statistic inference • A general framework to incorporate other information.