180 likes | 282 Views
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments. Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Presented by Thahir P. Mohamed.
E N D
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Presented by Thahir P. Mohamed Advancing Practice, Instruction & Innovation through Informatics October 19-23, 2008
2 Protein Structure Primary Structure: Chain of amino acids Secondary Structure: Sub-structures such as helixes and strands Tertiary Structure:Atomic resolution of protein structure Protein structure is essential for successful design of drugs
3 Challenges in Protein Structure Prediction • X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure. • Very expensive • Very time consuming • Computational techniques are applied to predictprotein structure
4 Computational Protein Structure Prediction • Machine Learning techniques applied to predict structure • Experimentally determined structures are used to learnto predict new structures • When not enough data to learn from: • Active learning is applied to select the next protein to be studied experimentally
5 Active Learning Unlabeled Proteins Possible Labels:
Active Learning Clustered Protiens Possible Labels: Cluster Unlabeled Proteins
7 Selection Algorithm Active Learning Clustered Proteins Possible Labels: Cluster Unlabeled Proteins
8 Selection Algorithm Active Learning Clustered Proteins Possible Labels: Cluster Unlabeled Proteins
9 Selection Algorithm Prediction Active Learning Labeled Protiens Possible Labels: Cluster Unlabeled Proteins Active learning guides selection of data points for which you ask for labels
Membrane Protein Structure Prediction Membrane Protein importance and challenges 10 Membrane Proteins: • 30% of genes • cell regulation and signaling pathways • 60% of drug targets Yet, • Difficult to study experimentally • 1% of known protein structures Active learning can be used as a tool against the limited number of known MP structures despite the large number of known MP sequences
‘Features’ Representation 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 Residue: A L H W R A A G A A T V L L V I V E R G A P G A Q L I Topology: - - - - - M M M M M M M M M M M M - - - - - - - - - - Charge: - - p – p - - - - - - - - - - - - n p - - - - - - - - E-Prop: D d . . A D D . D D a d d d d d d D A . D D . D a d d Properties Charge Size Polarity Aromaticity Electronic Properties Data reduction is performed by SVD, resulting in a final 4 features per window.
Dim 3 Dim 1 Dim 2 Clustering the Data • Neural Network Self Organizing Map (SOM) • Finds centroids of clusters in the data
Design 1:Density-based Selection • Find the most dense cluster • Choose N points closest to its centroid • Find labels for these points (TM or NTM) • Find the majority label, say L • Assign L to all points in the cluster • Repeat for next dense cluster Clusters with no known structures are marked for study by experiments
Design 1 Results Increase the number of data points for which we ask structure Compare how accuracy varies between guided selection (via active learning) versus random selection. A total of only 10 labels per node ~ 1% data
Design 2:Protein – based Selection • Pick a random protein • Find labels for all windows in this protein • For each node containing labels, find the mode L of all labels it contains • Assign L to remaining data in node • Repeat and update for new protein, until half have been selected
Percent Protein-based results Repeated for different permutations of protein selection order, and observed several metrics.
Conclusions • We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences. • We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.
Acknowledgements Madhavi GanapathirajuJessica Wehner JW funded through NIH-NSF Bioengineering & Bioinformatics Summer Institute Visit us at Department of Biomedical Informatics University of Pittsburgh Thank you! www.dbmi.pitt.edu/madhavi Cathedral of Learning, University of Pittsburgh