1 / 14

Protein Localization Prediction: Decision Tree & Neural Networks

Predicting cellular localization sites of proteins using Decision Tree and Neural Networks. The cellular localization information is embedded in protein sequences, challenging to extract. Implementing AI algorithms - Decision Tree and Neural Networks for accurate predictions and valuable insights.

arceneaux
Download Presentation

Protein Localization Prediction: Decision Tree & Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12

  2. 2008 Nobel Prize in Chemistry Roger Tsien Osamu Shimomura Martin Chalfie Use GFP to track a protein in living cells Green Fluorescent Protein (GFP)

  3. The cellular Localization information of a protein is embedded in protein sequence PKKKRKV: Nuclear Localization Signal VALLAL: transmembrane segment Amino Acid sequence of a protein Cellular Localization Sites Challenge: predict

  4. Extracting cellular localization information from protein sequence • mcg: McGeoch's method for signal sequence recognition. • gvh: von Heijne's method for signal sequence recognition. • alm: Score of the ALOM membrane spanning region prediction program. • mit: Score of discriminant analysis of the amino acid content of the N- terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. • erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. • pox: Peroxisomal targeting signal in the C-terminus. • vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. • nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.

  5. Problem Statement & Datasets Protein Name mcg gvh lip chg aac alm1 alm2 Location EMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cp ATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imS NFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im Dataset 1: 336 proteins from E.coli (Prokaryote Kingdom) http://archive.ics.uci.edu/ml/datasets/Ecoli Dataset 2: 1484 proteins from yeast (Eukaryote Kingdom) http://archive.ics.uci.edu/ml/datasets/Yeast

  6. Implementation of AI algorithms • Decision Tree > C5 • Neural Network > Single layer feed-forwad NN: Perceptrons > Multilayer feed-forward NN: one hidden layer

  7. Implementation of Decision Tree: C5 • Preprocessing of Dataset > If the data point is linear and continous, divide the data range to 5 equal-width bins: tiny, small, medium, large, huge. Then discretize the data points to these bins. > if the feature value is missing (?), replace ? with tiny. • Generating training set and test set > Randomly split the data set to training set and test set such that 70% will be in the training set and 30% for test set. • Learning the Decision Tree > using the decision tree learning algorithm in chapter 18.3 of text book • Testing

  8. Att 1 Att 1 input output Desired output input output Desired output Att 2 Att 2 cp 1 cp 1 Att 3 Att 3 imS 0 imS 0 Att 4 Att 4 im 0 im 0 Implementation of Neural Networks • Structure of Perceptrons and two-layer NN Protein Name mcg gvh lip chg aac alm1 alm2 Location EMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cp ATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imS NFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im Perceptrons Two-layer NN

  9. Implementation of Perceptrons & Two-layer NN: Algorithms Function PERCEPTRONS-LEARNING (examples, network) initially set correct=0 initialize the weight matrix w[i][j] with randomized number within[-0.5,0.5] While(correct < threshold) //threshold =0.0, 0.1, 0.2…, 1.0 for each e in the example do calculate output for each output node //g() is sigmoid function prediction = r such that if r != y(e) for each output node j for i=1,…,m endfor endfor endif endfor endwhile Return w[i][j] 2-layer NN(example,network) Using the Back-Prop-Learning in Chap 20.5 of textbook

  10. Results • The statistics for Decision Tree are average over 100 runs • The statistics for Perceptrons and Two-layer NN are average over 50 runs • Threshold is the termination condition for training the neural networks

  11. Conclusions • The two datasets are linearly inseparable. • For the E.coli dataset, DT, Perceptrons, Two-layer NN achieve similar accuracy • For the yeast dataset, Perceptrons, Two-layer NN achieve slightly better accuracy than DT • All the three AI algorithms have much better accuracy than the simple majority algorithm

  12. Future work • Probabilistic model • Bayesian network • K-Nearest Neighbor • ……

  13. mcg gvh alm mit erl pox vac nuc Classifiers A protein localization sites prediction scheme prediction Guide the experimental design and biological research, save much labor and time!

  14. Thank you!

More Related