PATTERNS

PATTERNS Kim Henrick

Terri Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK http://www.bioinf.man.ac.uk/dbbrowser/

Motif:a simple combination of a few consecutive secondary structure elements with a specific geometric arrangement (e.g., helix-loop-helix). May have a specific biological function. Domain: the fundamental unit of structure folding and evolution. It combines several secondary elements and motifs packed in a compact globular structure. A domain can fold independently into a stable 3D structure, and May have a specific function. Domain family: proteins that share a domain (possibly in combination with other domains) Protein family: proteins that have the same combination of domains Motifs and domains

Profiles & Motifs are Useful Helped identify active site of HIV protease Helped identify SH2/SH3 class of STP’s Helped identify important GTP oncoproteins Helped identify hidden leucine zipper in HGA Used to scan for lectin binding domains Regularly used to predict T-cell epitopes Domains are More Useful

Rules of Thumb Sequence pattern-based motifs should be determined from no fewer than 5 multiply aligned sequences A good degree of sequence divergence is needed. If “S” is the %similarity and “N” is the no. of sequences then 1 - SN > 0.95 A good sequence pattern should have no fewer than 8 defined amino acid positions

Representations of protein families • Regular expression • Position specific scoring matrices (profiles) • Hidden Markov Models • Probabilistic suffix trees • Sparse Markov transducers

Pattern recognition methods These methods classify proteins into families the basis of the methods is multiple sequence alignment They depend on developing a representation of conserved elements of alignments that may be diagnostic of structure or function, whether from homologous sequence families sequences that share some structural/functional domains

Regular expressions/patterns These are derived from single conserved regions, which are reduced to consensus expressions for db searches they are minimal expressions, so sequence information is lost the more divergent the sequences used, the more fuzzy & poorly discriminating the pattern becomes Alignment Pattern GAVDFIALCDRYF GPIDFVCFCERFY G-X-[IV]-[DE]-F-[IVL]-X2-C-[DE]-R-[FY]2 GRVEFLNRCDRYY

Regular expressions/patterns Patterns do not tolerate similarity • sequences either match or not, regardless of how similar they are • matching is a binary ‘on-off’ event & frequently misses true matches • single-motif methods are very hit-or-miss – how do you know if you've encoded the ‘best’ region?

PROSITE This represents an apparent 18% error rate the actual rate is probably higher Thus, a match to a pattern is not necessarily true & a mis-match is not necessarily false! False-negatives are a fundamental limitation to this type of pattern matching if you don't know what you're looking for, you'll never know you missed it! • G_PROTEIN_RECEPTOR; PATTERN • PS00237; • G-protein coupled receptor signature • [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]- • X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R • /TOTAL=919(919);/POS=869(869);/FALSE_POS=50(50);/FALSE_NEG=70; • /PARTIAL=49; UNKNOWN=0(0)

Regular expressions/rules Regular expression patterns are most effective when applied to highly-conserved, family-specific motifs It is often possible to identify, shorter generic patterns that are characteristic of common functional sites Functional site Rule N-glycosylation N-{P}-[ST]-{P} Protein kinase C phosphorylation [ST]-X-[RK] Casein kinase II phosphorylation [ST]-X2-[DE] Such features result from convergence to a common property glycosylation sites, phosphorylation sites, etc. They cannot be used for family diagnosis & don't discriminate they can only be used to suggest whether a certain functional site might exist (which must then be tested by experiment) such patterns are termed rules

Diagnostic limitations of short motifs Consider the sequence motif Asp-Ala-Val-Ile-Asp (DAVID) results of db searching for such a sequence will differ, depending on whether we search for exact or permissive ‘fuzzy’ matches Pattern Matches D-A-V-I-D 99 D-A-V-I-[DEQN] 252 [DEQN]-A-V-I-[DEQN] 925 [DEQN]-A-[VLI]-I-[DEQN] 2,739 [DEQN]-[AG]-[VLI]-[VLI]-[DEQN] 51,506 D-A-V-E 1,493 (number of matches in OWL31.1) Use of fuzzy regular expressions has the potential advantage of being able to recognise more distant relationships & the inherent disadvantage that more matches will be made by chance, making it difficult to separate out true matches from noise

Fingerprints Fingerprints are groups of motifs excised from alignments & used for iterative db searching no weighting scheme is used searches depend only on residue frequencies resulting scoring matrices are thus sparse Each motif trawls the database independently search results are correlated to determine which sequences match all the motifs & which match only partially no information is thrown away Iteration refines the fingerprint & increases its potency fingerprints are diagnostically more powerful than regular expressions

Profiles Profiles are scoring tables derived from full alignments these define which residues are allowed at given positions which positions are conserved & which degenerate which positions, or regions, can tolerate insertions the scoring system is intricate, & may include evolutionary weights, results from structural studies, & data implicit in the alignment variable penalties are specified to weight against INDELs occurring in core 2' structure elements

Profiles Within a profile, fields contain position-specific scores for insert & match positions • in conserved regions, INDELs aren't totally forbidden, but are strongly impeded by large penalties defined in a DEFAULT field • these are superseded by more permissive values in gapped regions • the inherent complexity of profiles renders them highly potent discriminators, but they are time-consuming to derive

Hidden Markov Models HMMs are similar in concept to profiles they are probabilistic models consisting of inter-connecting states essentially, linear chains of match, delete or insert states

Hidden Markov Models • Match states are assigned to conserved columns in an alignment • Insert states allow for insertions relative to match states • Delete states allow match positions to be skipped Thus, building an HMM requires each position in an alignment to be assigned to match, delete or insert states

Hidden Markov Models HMMs usually perform well, but can be over-trained • they may also suffer if created from automatic iterative processes • if it once accepts a false match, an HMM becomes corrupt

Probabilistic Suffix Trees • Identify short significant contiguous segments • Do not require multiple alignment • Induces a probability distribution on the next symbol to appear right after the segment (short term memory) • Variable memory length • More efficient than order L Markov chains • Longer memory length compared to first-order HMMs, and easier to learn

Which method is best? The range of methods available leads to familiar problems which should we use? which is the most reliable? which is the most comprehensive? None of the pattern-recognition techniques is infallible each has its optimum area of application None of the resulting pattern databases are complete none is the best bearing in mind the diagnostic strengths & weaknesses of the different approaches, & keeping biological significance in mind, the best strategy is to use them all

Pattern recognition & prediction In investigating the meaning of sequences, 2 distinct analytical approaches have emerged pattern recognition is used to detect similarity between sequences & hence to infer related structures & functions ab initio prediction is used to deduce structure, & to infer function, directly from sequence These methods are different & shouldn’t be confused !!!!!

Pattern recognition & prediction • Sequence- & structure-based pattern recognition methods demand that some characteristic has been seen before & housed in a db • Prediction methods remove the need for template dbs because deductions are made directly from sequence

fact & fiction Sequence pattern recognition is easier to achieve, & is much more reliable, than fold recognition which is only ~40-50% reliable even in expert hands Prediction is still not possible & is unlikely to be so for decades to come (if ever) Structural genomics will yield representative structures for more proteins in future structures of new sequences will be determined by modelling prediction will become an academic exercise But, to debunk a popular myth, knowing structure alone does not inherently tell us function

Prediction methods don’t work because we don’t fully understand the Folding Problem we can’t read the language sequences use to create their folds But, with sequence analysis techniques, we can try to find similarities between new sequences & those in dbs whose structures & functions we hope have been elucidated This is straightforward at high levels of identity, but below 50% it is difficult to establish relationship reliably Analyses can be pursued with decreasing certainty ~20% identity, where results may look plausible to the eye, but are no longer statistically significant fact & fiction

TERMINOLOGY

Homology & analogy The term homology is confounded & abused in the literature! sequences are homologous if they’re related by divergence from a common ancestor analogy relates to the acquisition of common features from unrelated ancestors via convergent evolution e.g.,b-barrels occur in soluble serine proteases & integral membrane porins; chymotrypsin & subtilisin share groups of catalytic residues, with near identical spatial geometries, but no other similarities

Homology & analogy Homology is not a measure of similarity & is not quantifiable • it is an absolute statement that sequences have a divergent rather than a convergent relationship • the phrases "the level of homology is high" or "the sequences show 50% homology", or any like them, are strictly meaningless! This is not just a semantic issue • loose use muddies thinking about evolutionary relationships

A terminology muddle In comparing 3D structures, exactly the same arguments apply structures may be similar, as denoted by RMS positional deviation between compared atomic positions common evolutionary origin remains a hypothesis, until supported by other evidence homology among similar structures is a hypothesis This may be correct or mistaken, but their similarity is a fact, no matter how it is interpreted Similarity of sequence or structure is just that - similarity Homology connotes a common evolutionary origin

Classification of homologs Orthologs – Two genes from two different species that derive from a single gene in the last common ancestor of the species. Paralogs – Two genes that derive from a single gene that was duplicated within a genome.

Orthology & paralogy Among homologous sequences we can distinguish orthologues - largely perform the same function in different species paralogues - perform different but related functions in one organism

Orthology & paralogy Studying orthologues opens the way to molecular palaeontology e.g., using phylogenetic trees to show cross-species relationships Paralogues shed light on underlying evolutionary mechanisms • paralogous proteins are thought to have arisen from single genes via successive duplication events • duplicated genes follow separate evolutionary pathways & new specificitiesevolve through variation & adaptation Such complexity presents real challenges for sequence analysis

Classification of homologs Inparalogs - paralogs that evolved by gene duplication after the speciation event. Outparalogs - paralogs that evolved by gene duplication beforethe speciation event.

Challenges for sequence analysis Much of the challenge is in getting the biology right complicated by orthology vs paralogy Following a db search, it may be unclear how much functional annotation can be legitimately inherited by a query source of numerous annotation errors in dbs propagation could lead to an error catastrophe

Challenges for sequence analysis Further complications result from the modular nature of proteins modules are autonomous folding units, used as protein building blocks - like Lego bricks, they can confer a variety of functions on the parent protein, either by multiple combinations of the same module, or via different modules to form mosaics Automatic systems don’t distinguish orthologues from paralogues & don’t consider the modular nature of proteins

Identifying evolutionary links between sequences is useful this often implies a shared function Arguably, prediction of function from sequence is of more immediate value than the prediction of structure However, between distantly-related proteins, structure is more conserved than the underlying sequences thus, some relationships are only apparent at the structural level Such relationships can't be detected by even the most sensitive sequence comparison methods there is thus a theoretical limit to the effectiveness of sequence analysis methods and a region of identity where sequence comparisons fail completely to detect structural similarity Challenges for sequence analysis

What can we learn from them? Ortholog proteins are evolutionary, and typically functional counterparts in different species. Paralog proteins are important for detecting lineage-specific adaptations. Both of them can reveal information on a specific species or a set of species.

Prosite http://www.expasy.ch/prosite/ Pfam http://www.sanger.ac.uk/Software/Pfam/ Blocks http://www.blocks.fhcrc.org/ ProDom http://prodes.toulouse.inra.fr/prodom/doc/prodom.html Prints http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ Domo http://www.infobiogen.fr/services/domo/ InterPro http://www.ebi.ac.uk/interpro/ Smart http://smart.embl-heidelberg.de/ eMotif http://dna.stanford.edu/identify Databases of protein domains

Integrating Pattern Databases • MetaFam • IProClass • CDD • InterPro

Reference Domains, motifs, and clusters in the protein universe Jinfeng Liu & Burkhard Rost Current Opinion in Chemical Biology, Vol 7 No 1 2003

PATTERNS

PATTERNS

Presentation Transcript

Patterns

Patterns

Patterns

Patterns

Patterns Anti-Patterns Refactoring

Patterns

Patterns

PATTERNS

Patterns

Patterns

Patterns

Patterns

Patterns, Patterns and More Patterns

PATTERNS

PATTERNS

Patterns

patterns

Patterns

Patterns

Patterns

Patterns

Patterns