1 / 11

Identifying functional residues of proteins from sequence info

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles Remote homologs with no known structure Given a large, diverse superfamily protein may evolve different function or subtype

wes
Download Presentation

Identifying functional residues of proteins from sequence info

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying functional residues of proteins from sequence info • Using MSA (multiple sequence alignment) • - search for remote homologs using HMMs or profiles • Remote homologs with no known structure • Given a large, diverse superfamily • protein may evolve different function or subtype • different substrate specificity or activity • proteins with similar fold but different function • Past methods used phylogenetic trees • map unknown protein to one of the branches of the tree produced • but- maybe diverged to long ago to be clearly identified • co-evolution of multiple features • possible convergent evolution of molecular function at aa level

  2. Other methodologies: • Analysis/prediction of subtype from sequence alignments • characterization of aa residues, looking for significant substitutions • gathering sequences into subgroups, comparing each subgroup • Principal component analysis (Casari et al, 1995) • looks for functional residues conserved in protein families • Evolutionary Trace (Lichtarge et al) • Phylogenetic Inference (Sjolander et al)

  3. Goal: identify regions conferring sub-family specificity • Secondary goal: predict subtypes of orphan sequences • Input to algorithm: • multiple sequence alignment (MSA) of sequences in a protein family • classification of subfamilies of sequences from above MSA • For the given subtypes (or subfamilies) provided: • get the MSA subalignment for each subfamily • build a HMM profile for each sub-family MSA • Rationale: generate pseudocounts and account for statistical bias • For each subalignment profile • The profile value for amino acid x at position i for subfamily j over all amino acids at a given position will sum to 1. (probability of finding an amino acid x at position i in the subfamily j)

  4. Relative Entropy • measure of “distance” between two probability distributions • Relative entropy produces a value >= 0. (value of 0 for two identical distributions) • for each position i in a subfamily s • For each position, a RE value for a subfamily s vs s-bar (all other subfamilies) • Cumulative Relative Entropy • given a set of relative entropies for each subfamily for each position • To produce a CRE for a given position i in the MSA across all subfamilies.

  5. Given this set of cumulative relative entropy measures • one for each position in MRA- you take the Z score. • Standard statistical measure- the number of std dev’s above/below the mean • tells you which residue positions vary strongly in aa distribution between families • empirically, Z > 3 correlates with functional residue • For position i, which amino acid is dominant in a given subfamily • find probability of observing aa x at position in subfamily s vs not-s • Take the aa with probability >= 0.5 • We now have a small set of aa residues which differ strongly between subfamilies of a protein family.

  6. Subfamily data • What exactly constitutes a family or subfamily? • not always clear • automated tree generation could not separate data into clear subfamilies • use of PFAM alignments and SWISSPROT data • Subfamilies are not clearly defined in databases • divided proteins from PFAM database into subfamilies based on SWISSPROT data • keyword search limited to enzymatic activity string in SWISSPROT • put into groups, then checked for obvious mistakes • also eliminated divisions “easily discernable by sequence comparison” • 62 groupings from 42 alignments remained • randomly pick 1:1 to produce 42 groups over 42 alignments

  7. Subfamilies • Four very large families to test their results on • nucleotidyl cyclases • eukaryotic protein kinases • lactate/malate dehydrogenases • trypsin-like serine proteases • Nucleotidyl cyclases • membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP) • found residues 1018, 938, which correlate with previous results • also identified residues which have not been tested experimentally • Protein kinases • phosphorylate serine/threonine or tyrosine residues • compare to experimental result- some ser/thr vs tyr kinase differences not detected • inconsistency (no conservation) within the subfamily • residues which were common to both ser/thr and tyr kinases

  8. Subfamilies (cont) • Lactate/Malate Dehydrogenases • common to a very wide variety of organisms- highly divergent • results mostly as expected- but a few residues identified outside of active site • Serine Proteases • cut protein backbone- differing specificity as to where (what aa precedes cut) • specificity pocket determines where protease can bind • identified 2 out of 3 of experimentally-determined pocket residues • (third had a low z-score because of tolerance in one protein family) • also identified a few residues outside of the active site

  9. Prediction of Protein Subfamily • Sequence Similarity • straight % similarity with other sequences (ignoring gaps) • BLAST • database search, assign to nearest subfamily with best alignment • HMM method • align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment • will attempt to do iterative optimization of match… • Profile method • take original HMM, and probability profile • Sub-profile method • only use residues in above formula that have a positive Z-score • to reduce noise, restrict to values that have above average positive relative entropy

  10. Casari, et al. (1995) A method to predict functional residues in proteins • Input: a multiple-sequence alignment • each sequence is converted to a vector of size (20 * l) where l is length of the alignment • Generation of of N x (20*l) matrix • one sequence produces a vector of dimensions 20*l • N sequences to produce N vectors of dimension 20*l • Use Principal Component Analysis • get the covariance matrix- tells you how factors are correlated to one another • eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix • largest eigenvalues and corresponding eigenvectors give you principal components • ie the largest factors determining distribution of your dataset • they take the three largest (the largest of which represents consensus sequence) • project their 20*l dimensional data onto those 3 dimensions • this can be used to predict a protein subfamily for a given protein

  11. General Weirdness • Construction of a “comparison matrix” • take matrix x (matrix transpose) • solve for eigenvectors and eigenvalues as before • Columns of f represent amino acid values and positions • becomes possible to examine individual amino acid residues and positions • plotted on graph, shows residue correlation to type of protein subfamily • does this actually work?

More Related