1 / 11

Theoretical methods for predicting gene function II. predicting protein domains

Theoretical methods for predicting gene function II. predicting protein domains and their function from sequence analysis. S. Wodak, ULB Inter-university DEA/DES in Bioinformatics. The main steps. [3.1]. Predict domains. [3.2]. Predict function of individual domains. Family G

pier
Download Presentation

Theoretical methods for predicting gene function II. predicting protein domains

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Theoretical methods for predicting gene function II. predicting protein domains and their function from sequence analysis S. Wodak, ULB Inter-university DEA/DES in Bioinformatics

  2. The main steps [3.1] Predict domains [3.2] Predict function of individual domains Family G Funct(s) Y Family A Funct(s) X Family M Funct(s) Z Family F Funct(s) W

  3. Domain analysis Proteins tend to be modular -> domains. A first step in functional prediction/annotation can be a scan for known domains in a newly sequenced protein Scan databases of ‘fingerprints’ of classified domains: PROSITE (Bairoch et al., 1997):consensus sequence strings for more than 1000 domains PROFILESCAN: BLOCKS (Henickoff et al., 1998):ungapped alignments and pattern matching PRINTS(Attwood et al., 1998):a set of multiple seq. motifs separated along the sequence PFAM (Batemann et al., ): HMM from multiple alignments

  4. Example: The alcohol dehydrogenase domain (Demo) (PDB-code 8ADH) CATH: http://www.biochem.ucl.ac.uk/bsm/cath_new/domains/8adh02.html PDBsum Swiss-Prot PROSITE pattern associated with Zinc binding/active site PFAM PROSITE etc..

  5. Zinc binding constellation in carbonic anhydrase

  6. Predicting function of individual domains based on sequence similarity 1- Intrinsic feature analysis - compositional biases -transmembrane regions (stretched of hp residues) -coiled coil segments (hepta-repeats of pol/hp residues) -pro rich, glu rich If not eliminated first, can lead to spurious hits, and thus erroneous inference of function 2- Sequence alignments - Pairwise alignments Blast, Fasta : >40% sequence identity - Multiple alignments: <40% sequence identity -Psi-Blast - SAM-98 (HMM)/PFAM More sensitive Erroneous inference of function can still be made, because sequence Similarity does not guarantee structural similarity.

  7. Predicting function based on sequence alignments >40% sequence identity Pairwise alignments, Blast, Fasta -can be used to ‘safely’ infer function fororthologs: close homologs, genes evolved as a result of speciation (not duplication); likely to perform same function in different species ->comparions of the sequence tree and the species tree, can help identify orthologs. Inferring function for non-ortholog homologs -much more error prone. 7/10 genes will have a homolog in the sequence DB’s.. and some fraction of those will have a known 3D structure <40% sequence identity ->But the structural and functional features of the homolog cannot be transferred without additional analysis

  8. Detection of remote homologs - Multiple alignments: -Psi-Blast: Position specific Iterated Blast -HMM Hidden Markov Models C - Other: -ISS Intermediate sequence search B A Sequence comparisons using multiple sequence alignments detect 3x as many homologs as pairwise alignments Park et al. (1998) J. Mol. Biol. 284, 1201-1210

  9. Sequence comparisons using multiple sequence alignments detect 3x as many homologs as pairwise alignments Park et al. (1998) J. Mol. Biol. 284, 1201-1210 error rate 1/100,000 error rate 1/1000 PDBD40-J Database of 935 sequences with ≤40% sequence identity and known evolutionary relationships from SCOP: -Gap-Blast -Fasta -Psi-Blast: -SAM-98 -ISS 14 16 27 29 24 19 23 44 50 34 % homologs recognised NRDB90 Database of 152,228 non redundant sequences (<90% sequence identity) from other sequence DB’s SCOP

  10. Structural proteomics: extending structure information to sequences Library of known folds New sequence Assign known fold from library Function Build detailed Atomic model

  11. Detection of remote homologs across genomes Pfam... Slide incomplete

More Related