370 likes | 824 Views
Function Prediction from Protein Sequence. Basic definitions. Primary structure the linear sequence of amino acids in a protein Secondary structure regions of local regularity i.e., a -helices, b -strands, -sheets & -turns. Definitions contd. Super-secondary structure
E N D
Basic definitions • Primary structure • the linear sequence of amino acids in a protein • Secondary structure • regions of local regularity • i.e.,a-helices, b-strands, -sheets & -turns
Definitions contd. • Super-secondary structure • the packing of secondary structure elements into stable units • e.g., b-barrels, bab units, Greek keys, etc..
Definitions contd. • Tertiary structure • the overall chain fold that results from packing of secondary structure elements
Definitions contd. • Quaternary structure • the arrangement of separate chains within a protein that has more than one subunit • e.g., haemoglobin
Definitions contd. • Quinternary structure • the arrangement of separate molecules, such as in protein-protein or protein-nucleic acid interactions
Importance of sequence analysis • Feb 2008 >67 million sequences available in Genbank • & millions more (including ESTs) in proprietary dbs • these #s will snowball with completion of more genomes • so what? • Locked up in sequences is a huge amount of structural, functional & evolutionary info • they're ahighly valuable resource • By contrast, the # of solved protein structures is ~45000 • a hugeinformation deficit
The legacy of the genome projectsSequence-structure deficit • Non-redundant growth of sequences during 1988-2002 ( ) & the corresponding growth in the number of structures ( ). 800 700 600 500 400 300 200 100 1988 2002
Challenges for bioinformatics • Spurred on by the seq/structure deficit, the challenges • rationalisethe mass of sequence data • derive moreefficientmeans of data storage • design moreincisive&reliableanalysis tools • The imperative - to convert sequence information into biochemical & biophysical knowledge • to decipher the structural, functional & evolutionary clues encoded in thelanguageof biological sequences
The Holy Grail of bioinformatics • ...to be able to understand thewordsin a sequence sentence that form a particular protein structure
Pattern recognition & prediction • In investigating the meaning of sequences, two distinct analytical approaches have emerged • pattern recognitionis used to detect similarity between sequences & hence to infer related structures & functions • ab initio predictionis used to deduce structure, & to infer function, directly from sequence • These methods are quite different! • pattern recognition methodsdemandthat some characteristic has beenseenbefore & housed in a db • prediction methods remove the need for template dbs, because deductions are madedirectlyfrom sequence
Science fact & fiction • Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition • which is ~50% reliable even in expert hands • Prediction is still not possible • & is unlikely to be so for decades to come (if ever) • But, to debunk a popular myth, knowing structure alone does not inherently tell us function
The Twilight Zone • Prediction methods don’t work because we don’t fully understand the Folding Problem • we can’t read the language sequences use to create their folds • But, with sequence analysis techniques, we can try to find similarities between new sequences & those in dbs • whose structures & functions we hope have been elucidated • This is straightforward at high levels of identity, but below 50% it is difficult to establish relationships reliably • Analyses can be pursued with decreasing certainty towards the Twilight Zone • ~20% identity, where results may look plausible to the eye, but areno longer statistically significant
Beyond the Twilight Zone • To penetrate deeper into the Twilight Zone is the aim of most analytical methods • whether using single sequences, motifs, complex weighting schemes or raw amino acid frequencies • Each offers a different perspective, depending on the type of information used in the search • none gives therightanswer • It is good practice to devise an analysis protocol that uses a variety of methods • but don’t expect the impossible – no method is infallible!
Application areas of analysis tools The scale indicates % identity between aligned sequences Alignment of 2 random seqs can produce ~20% identity less than 20% does not constitute a significant alignment around this threshold is theTwilight Zone, where alignments may appear plausible to the eye, but can’t be proved by conventional methods
Homology & analogy • The term homology is confounded & abused! • sequences arehomologousif they are related bydivergence from a common ancestor • analogyrelates to the acquisition of common features from unrelated ancestors viaconvergentevolution • e.g., b-barrels occur in soluble & membrane proteins; enzymes chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities
The Midnight Zone • In the genome era, prediction of function from sequence is of more immediate value than is the prediction of structure • However, between distantly-related proteins, structure is more conserved than the underlying sequences • thus, some relationships are only apparent at the structural level • Such relationships can’t be detected by even the most sensitive sequence comparison methods • the region of identity where sequence comparisons fail completely to detect structural similarity is theMidnight Zone – there is thus atheoretical limitto the effectiveness of sequence analysis methods
Family (pattern) databases • SWISS-PROT is emerging as a standard, & most pattern dbs use it as their basis • PROSITESWISS-PROTRegular expressions (patterns) • PRINTSSWISS-PROT/TrEMBLAligned motifs (fingerprints) • PfamSWISS-PROT/TrEMBLHidden Markov Models (HMMs) • ProfilesSWISS-PROTWeight matrices (profiles) • Blocks InterPro/PRINTSWeighted motifs (blocks) • eMOTIFBlocks/PRINTSPermissive regular expressions
Why create pattern databases? • Pattern dbs arise from the need to make more specific functional diagnoses than are possible simply by searching • They are built on the principle that homologous sequences may be gathered together in multiple alignments, within which are regions (motifs) that show little variation • these motifs usually reflect some vital biological role in terms of either structure or function • Motifs are exploited in different ways to build diagnostic patterns for protein families • new sequences can be searched against dbs of such patterns to see if they can be assigned to known families • hence they offer a fast track to the inference offunction
Methods for family analysis Single motif methods Fuzzy regex (eMOTIF) Full domain alignment methods Exact regex (PROSITE) Profiles (PROFILE LIBRARY) HMMs (Pfam) Identity matrices (PRINTS) Multiple motif methods Weight matrices (Blocks)
PROSITE • The first pattern db • based on the idea that a protein family can be characterised by apattern of conserved residues within asingle motif • Sequence information in motifs is reduced to consensus or regular expressions (regexs) & the seed regex used to search SP • results are inspected manually to achieve optimal results • Some families can’t be characterised by single motifs • here, additional regexs are created until an optimal set is achieved that captures most or all of the family • results are thenmanually annotatedfor inclusion in the db
Regular expressions (patterns) • These are derived from single conserved regions in alignments • they are minimal expressions, so sequence information is lost • the more divergent the sequences used, the more fuzzy & poorly discriminating the regex becomes • Alignment Regex • GAVDFIALCDRYF • GPIDFVCFCERFYG-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2 • GRVEFLNRCDRYY • Regexs do not tolerate similarity • sequences either match or not, regardless of how similar they are • matching is abinary‘on-off’ event & frequently misses true matches • single-motif methods are veryhit-or-miss – how do you know if you've encoded the ‘best’ region?
Regular expressions (rules) • Regex patterns are most effective when applied to highly-conserved, family-specific motifs • It is often possible to identify, shorter generic patterns within sequences, characteristic of common functional sites • Functional site Rule • N-glycosylationN-{P}-[ST]-{P} • Protein kinase C phosphorylation[ST]-X-[RK] • Casein kinase II phosphorylation[ST]-X2-[DE] • Such features result from convergence to a common property • glycosylation sites, phosphorylation sites, etc. • They cannot be used for family diagnosis & don’t discriminate • they can only be used to suggest whether a certain functional sitemightexist (which must then be tested by experiment) • such patterns are normally termedrules
Residue groups for fuzzy regexs • It is possible to assign residues to groups based on various biochemical properties – e.g., charge & size • using such groups theoretically ensures that resulting regexs havesensible biochemical interpretations • small Ala, Gly • small hydroxyl Ser, Thr • basic His, Lys, Arg • aromatic Phe, Tyr, Trp • aliphatic Val, Leu, Ile, Met • acidic/amide Asp, Glu, Asn, Gln • small/polar Ala, Gly, Ser, Thr, Pro • This is more flexible than exact regex matching
Profiles & Pfam • An alternative to motif-based methods exploits regions between motifs, which also contain valuable information • the full alignment effectively becomes the discriminator • A complex scoring scheme allowing for substitutions & INDELs is used to create family-specific profiles • These profiles can be used to detect distant relation-ships, where only few residues are conserved • this is the basis of the Profile library • In an extension of this approach, alignments are encoded as probabilistic models termed HMMs • this is the basis of Pfam
Profiles • Profiles are scoring tables derived from full domain alignments • these define which residues are allowed at given positions • which positions are conserved & which degenerate • which positions, or regions, can tolerate insertions • the scoring system is intricate, & may include evolutionary weights, results from structural studies, & data implicit in the alignment • variable penalties are specified to weight against INDELs occurring in core 2' structure elements • Within a profile, the I & M fields contain position-specific scores for insert & match positions • in conserved regions, INDELs aren't totally forbidden, but are stronglyimpededby large penalties defined in the DEFAULT field • these are superseded by morepermissivevalues in gapped regions • the inherent complexity of profiles renders them highly potent discriminators, but they are time-consuming to derive
A HMM I1 I2 I3 B S N M1 M2 M3 M4 C T E W L R L Y C D C E D2 D3 D4 D1 G A J E L Y C D L W C
Fingerprints • Fingerprints are groups of conserved (ungapped) motifs excised from alignments & used for iterative db searching • no weighting scheme is used • searches depend only onresidue frequencies • resulting scoring matrices are thussparse • Each motif trawls the db independently • search results are correlated to determine which sequences match all the motifs & which match only partially • no information is thrown away • The iterative process refines the fingerprint & increases its power • potency is gained from themutual contextof motif neighbours • results are biologically more meaningful than those from single motifs
TM domain loop region TM domain
Composite pattern databases • To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource– InterPro • release 4.0 contains4691entries • a central annotation resource, with pointers to its satellite dbs • initial partners were PRINTS, PROSITE, profiles & Pfam • new partners include ProDom, TIGRfam, SMART & hopefully others (e.g., Blocks, MetaFam) • lags behind its sources • major role in fly & human genome annotation