140 likes | 156 Views
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks. Yetian Chen 2008-12-12. 2008 Nobel Prize in Chemistry. Roger Tsien. Osamu Shimomura. Martin Chalfie. Use GFP to track a protein in living cells. Green Fluorescent Protein (GFP).
E N D
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12
2008 Nobel Prize in Chemistry Roger Tsien Osamu Shimomura Martin Chalfie Use GFP to track a protein in living cells Green Fluorescent Protein (GFP)
The cellular Localization information of a protein is embedded in protein sequence PKKKRKV: Nuclear Localization Signal VALLAL: transmembrane segment Amino Acid sequence of a protein Cellular Localization Sites Challenge: predict
Extracting cellular localization information from protein sequence • mcg: McGeoch's method for signal sequence recognition. • gvh: von Heijne's method for signal sequence recognition. • alm: Score of the ALOM membrane spanning region prediction program. • mit: Score of discriminant analysis of the amino acid content of the N- terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. • erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. • pox: Peroxisomal targeting signal in the C-terminus. • vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. • nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.
Problem Statement & Datasets Protein Name mcg gvh lip chg aac alm1 alm2 Location EMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cp ATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imS NFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im Dataset 1: 336 proteins from E.coli (Prokaryote Kingdom) http://archive.ics.uci.edu/ml/datasets/Ecoli Dataset 2: 1484 proteins from yeast (Eukaryote Kingdom) http://archive.ics.uci.edu/ml/datasets/Yeast
Implementation of AI algorithms • Decision Tree > C5 • Neural Network > Single layer feed-forwad NN: Perceptrons > Multilayer feed-forward NN: one hidden layer
Implementation of Decision Tree: C5 • Preprocessing of Dataset > If the data point is linear and continous, divide the data range to 5 equal-width bins: tiny, small, medium, large, huge. Then discretize the data points to these bins. > if the feature value is missing (?), replace ? with tiny. • Generating training set and test set > Randomly split the data set to training set and test set such that 70% will be in the training set and 30% for test set. • Learning the Decision Tree > using the decision tree learning algorithm in chapter 18.3 of text book • Testing
Att 1 Att 1 input output Desired output input output Desired output Att 2 Att 2 cp 1 cp 1 Att 3 Att 3 imS 0 imS 0 Att 4 Att 4 im 0 im 0 Implementation of Neural Networks • Structure of Perceptrons and two-layer NN Protein Name mcg gvh lip chg aac alm1 alm2 Location EMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cp ATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imS NFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im Perceptrons Two-layer NN
Implementation of Perceptrons & Two-layer NN: Algorithms Function PERCEPTRONS-LEARNING (examples, network) initially set correct=0 initialize the weight matrix w[i][j] with randomized number within[-0.5,0.5] While(correct < threshold) //threshold =0.0, 0.1, 0.2…, 1.0 for each e in the example do calculate output for each output node //g() is sigmoid function prediction = r such that if r != y(e) for each output node j for i=1,…,m endfor endfor endif endfor endwhile Return w[i][j] 2-layer NN(example,network) Using the Back-Prop-Learning in Chap 20.5 of textbook
Results • The statistics for Decision Tree are average over 100 runs • The statistics for Perceptrons and Two-layer NN are average over 50 runs • Threshold is the termination condition for training the neural networks
Conclusions • The two datasets are linearly inseparable. • For the E.coli dataset, DT, Perceptrons, Two-layer NN achieve similar accuracy • For the yeast dataset, Perceptrons, Two-layer NN achieve slightly better accuracy than DT • All the three AI algorithms have much better accuracy than the simple majority algorithm
Future work • Probabilistic model • Bayesian network • K-Nearest Neighbor • ……
mcg gvh alm mit erl pox vac nuc Classifiers A protein localization sites prediction scheme prediction Guide the experimental design and biological research, save much labor and time!