Identifying functional residues of proteins from sequence info

Identifying functional residues of proteins from sequence info • Using MSA (multiple sequence alignment) • - search for remote homologs using HMMs or profiles • Remote homologs with no known structure • Given a large, diverse superfamily • protein may evolve different function or subtype • different substrate specificity or activity • proteins with similar fold but different function • Past methods used phylogenetic trees • map unknown protein to one of the branches of the tree produced • but- maybe diverged to long ago to be clearly identified • co-evolution of multiple features • possible convergent evolution of molecular function at aa level

Other methodologies: • Analysis/prediction of subtype from sequence alignments • characterization of aa residues, looking for significant substitutions • gathering sequences into subgroups, comparing each subgroup • Principal component analysis (Casari et al, 1995) • looks for functional residues conserved in protein families • Evolutionary Trace (Lichtarge et al) • Phylogenetic Inference (Sjolander et al)

Goal: identify regions conferring sub-family specificity • Secondary goal: predict subtypes of orphan sequences • Input to algorithm: • multiple sequence alignment (MSA) of sequences in a protein family • classification of subfamilies of sequences from above MSA • For the given subtypes (or subfamilies) provided: • get the MSA subalignment for each subfamily • build a HMM profile for each sub-family MSA • Rationale: generate pseudocounts and account for statistical bias • For each subalignment profile • The profile value for amino acid x at position i for subfamily j over all amino acids at a given position will sum to 1. (probability of finding an amino acid x at position i in the subfamily j)

Relative Entropy • measure of “distance” between two probability distributions • Relative entropy produces a value >= 0. (value of 0 for two identical distributions) • for each position i in a subfamily s • For each position, a RE value for a subfamily s vs s-bar (all other subfamilies) • Cumulative Relative Entropy • given a set of relative entropies for each subfamily for each position • To produce a CRE for a given position i in the MSA across all subfamilies.

Given this set of cumulative relative entropy measures • one for each position in MRA- you take the Z score. • Standard statistical measure- the number of std dev’s above/below the mean • tells you which residue positions vary strongly in aa distribution between families • empirically, Z > 3 correlates with functional residue • For position i, which amino acid is dominant in a given subfamily • find probability of observing aa x at position in subfamily s vs not-s • Take the aa with probability >= 0.5 • We now have a small set of aa residues which differ strongly between subfamilies of a protein family.

Subfamily data • What exactly constitutes a family or subfamily? • not always clear • automated tree generation could not separate data into clear subfamilies • use of PFAM alignments and SWISSPROT data • Subfamilies are not clearly defined in databases • divided proteins from PFAM database into subfamilies based on SWISSPROT data • keyword search limited to enzymatic activity string in SWISSPROT • put into groups, then checked for obvious mistakes • also eliminated divisions “easily discernable by sequence comparison” • 62 groupings from 42 alignments remained • randomly pick 1:1 to produce 42 groups over 42 alignments

Subfamilies • Four very large families to test their results on • nucleotidyl cyclases • eukaryotic protein kinases • lactate/malate dehydrogenases • trypsin-like serine proteases • Nucleotidyl cyclases • membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP) • found residues 1018, 938, which correlate with previous results • also identified residues which have not been tested experimentally • Protein kinases • phosphorylate serine/threonine or tyrosine residues • compare to experimental result- some ser/thr vs tyr kinase differences not detected • inconsistency (no conservation) within the subfamily • residues which were common to both ser/thr and tyr kinases

Subfamilies (cont) • Lactate/Malate Dehydrogenases • common to a very wide variety of organisms- highly divergent • results mostly as expected- but a few residues identified outside of active site • Serine Proteases • cut protein backbone- differing specificity as to where (what aa precedes cut) • specificity pocket determines where protease can bind • identified 2 out of 3 of experimentally-determined pocket residues • (third had a low z-score because of tolerance in one protein family) • also identified a few residues outside of the active site

Prediction of Protein Subfamily • Sequence Similarity • straight % similarity with other sequences (ignoring gaps) • BLAST • database search, assign to nearest subfamily with best alignment • HMM method • align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment • will attempt to do iterative optimization of match… • Profile method • take original HMM, and probability profile • Sub-profile method • only use residues in above formula that have a positive Z-score • to reduce noise, restrict to values that have above average positive relative entropy

Casari, et al. (1995) A method to predict functional residues in proteins • Input: a multiple-sequence alignment • each sequence is converted to a vector of size (20 * l) where l is length of the alignment • Generation of of N x (20*l) matrix • one sequence produces a vector of dimensions 20*l • N sequences to produce N vectors of dimension 20*l • Use Principal Component Analysis • get the covariance matrix- tells you how factors are correlated to one another • eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix • largest eigenvalues and corresponding eigenvectors give you principal components • ie the largest factors determining distribution of your dataset • they take the three largest (the largest of which represents consensus sequence) • project their 20*l dimensional data onto those 3 dimensions • this can be used to predict a protein subfamily for a given protein

General Weirdness • Construction of a “comparison matrix” • take matrix x (matrix transpose) • solve for eigenvectors and eigenvalues as before • Columns of f represent amino acid values and positions • becomes possible to examine individual amino acid residues and positions • plotted on graph, shows residue correlation to type of protein subfamily • does this actually work?

Identifying functional residues of proteins from sequence info