130 likes | 287 Views
Functional Annotation of Proteins via the CAFA Challenge. Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010. What ’ s the problem?. Huge bottleneck = finding a protein ’ s function when given a protein sequence
E N D
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010
What’s the problem? • Huge bottleneck = finding a protein’s function when given a protein sequence • Incomplete, inaccurate, or inconsistent annotations are difficult to work with and can propagate • No good way to measure the accuracy of an annotation predictor
What are Gene Ontology (GO) terms? • GO = controlled vocabulary of “gene ontologies” • Cover three domains: • Cellular component • Molecular function • Biological process • Hierarchy: • Broad/general (e.g. “catalytic activity”) • Specific (e.g. “leukotriene-C4-synthase activity”)
Outline of Our Approach Other Secondary Structure Predictor? Betawrap Pro? CAFA targets (FASTA sequences) GO ids for each CAFA target SMURF? BLAST PFAM
Pfam: Protein Family Database • Collection of protein families represented by: • Multiple sequence alignments • Hidden Markov Models • Two sections of Pfam: • A: high-quality, manually-curated • B: large, automatically-generated Sample Multiple Sequence Alignment Sample Hidden Markov Model
BLAST: Basic Local Align’t Search Tool • Goal: find homologous (i.e. derived from a common ancester) sequences from a database • Various BLAST programs: • blastp = query: protein, database: protein • blastn = query: nucleotide, database: nucleotide • blastx = query: translated nucleotide, database: protein • tblastn = query: protein, database: translated nucleotide • tblastx = query: translated nucleotide, database: translated nucleotide
SMURF: Structural Motifs Using Random Fields • Determines whether a protein sequence contains one of the following super secondary structures: • 6-bladed propeller • 7-bladed propeller • 8-bladed propeller • Double blades (i.e. 6-6, 6-7,6-8…) • Developed at Tufts! • Some propeller functions: • Often WD40 repeat –protein-protein interaction • Signaling, transcription, cell cycle Smurf! 7-bladed propeller
Final Database Structure INPUT MAPPING OUTPUT RESULTS
Final Results Statistics Of 8,904 unknown sequences… 4,265 had at least one hit in PDB BLAST 4,824 had at least one hit in Pfam 104 had at least one hit in SMURF 789 3,445 12 19 In total, 5,694 unique sequences had at least one hit, a 63.9% success 4 1,356 69 Distribution of sequence hits by method
Example Result T38114 MDLDMNGGNKRVFQRLGGGSNRPTTDSNQKVCFHWRAGRCNRYPCPYLHRELPGPGSGPVAASSNKRVADESGFAGPSHR RGPGFSGTANNWGRFGGNRTVTKTEKLCKFWVDGNCPYGDKCRYLHCWSKGDSFSLLTQLDGHQKVVTGIALPSGSDKLY TASKDETVRIWDCASGQCTGVLNLGGEVGCIISEGPWLLVGMPNLVKAWNIQNNADLSLNGPVGQVYSLVVGTDLLFAGT QDGSILVWRYNSTTSCFDPAASLLGHTLAVVSLYVGANRLYSGAMDNSIKVWSLDNLQCIQTLTEHTSVVMSLICWDQFL LSCSLDNTVKIWAATEGGNLEVTYTHKEEYGVLALCGVHDAEAKPVLLCSCNDNSLHLYDLPSFTERGKILAKQEIRSIQ IGPGGIFFTGDGSGQVKVWKWSTESTPILS • BLAST: matches with PDB structures 2OVP, 3MKS, 2CNX, 1P22, 1NEX, 3N0E • Transcription, mitosis, methylation, protein binding • Pfam: match to family PF00642 • Zinc ion binding, nucleic acid binding • SMURF: match to 7-bladed β-propeller template • WD domain (protein binding)
Possible Future Directions • Improving functional annotation for β-propellers identified by SMURF • Analyze training set of propeller proteins with known function to build probabilistic model of protein function based on propeller type • Addition of other structural prediction tools for motifs with known function • G-coupled receptors, membrane bound proteins • Expansion of BLAST search to include full nr database