130 likes | 288 Views
Christophe Roos - MediCel ltd christophe.roos@ medicel .fi. Good solutions are advantageous. Evolution changes sequences. Motifs, profiles, structures. Part 5: modular proteins. Similarity is a tool in understanding the information in a sequence. Proteins share similar domains.
E N D
Christophe Roos - MediCel ltd christophe.roos@medicel.fi Good solutions are advantageous Evolution changes sequences Motifs, profiles, structures Part 5: modular proteins Similarity is a tool in understanding the information in a sequence
Proteins share similar domains By comparing several related sequences to each other, one can distiguish segments with higher level of conservation. Usually they have a key role in the function of a protein. Blast identifies related sequences fast but only roughly. Christophe Roos - 5/6 Profiles, motifs, structures
Refine the comparison • Multiple sequence alignments of the best scoring sequences fround by Blast (or some other way) is done with a more sensitive algorithm. • Example: The eyeless gene in the fruit fly is also found in several species: birds, mammals, reptiles, fish, invertebrates. There it is called PAX6. Christophe Roos - 5/6 Profiles, motifs, structures
Visualise the relationship • Once a multiple sequence alignment is done, it can also be used for finding relationship (evolutionary distance) • The distance is calculated as the amount of mutations needed to evolve from a putative ancestor to all used ‘present-day’ sequences. Then a path including all sequences is computed. Different metrics can be used (most parsimonious, maximum likelihood, etc). Christophe Roos - 5/6 Profiles, motifs, structures
Visualise the output of aligned domains First all sequence pairs are aligned and scored, then in a second round a multiple sequence alignment is built up. In this case (PAX6 proteins from vertebrates and fruit fly), two domains are more conserved than the rest of the sequence. The most conserved areas have been highlighted by the use of black or gray background and white text. Only part of the alignment is shown. Christophe Roos - 5/6 Profiles, motifs, structures
Profiles and motifs • A sequence motif is a locally conserved region of a sequence or a short sequence pattern shared by a set of sequences. • The term motif refers to any sequence pattern that is predictive of a molecule’s function, a structural feature, or a family membership. • Motifs can be detected in proteins, DNA and RNA sequences, but they most commonly refer to protein motifs. • Motifs can be represented for computational purposes as • Flexible patterns [K,R]-R-P-C-x(11)-C-V-S (qualitative, unweighted; see the Prosite database at www.expasy.org) • Position-specific scoring matrices (PSSM, see next page) • Profile hidden Markov models (HMM). These are rigorous probabilistic formulation of a sequence profile. They contain the same probability information as PSSMs but can also account for gaps. Christophe Roos - 5/6 Profiles, motifs, structures
Position specific scoring matrix • This corresponds to the flexible pattern of the paired box: [K,R]-R-P-C-x(11)-C-V-S A B C D E F G H I K L M N P Q R S T V W X Y Z * - -22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23 -16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0 0 -51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49 -44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0 0 -42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57 -62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0 0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0 -21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13 -35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0 0 -21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8 30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0 0 -31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8 -37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0 0 -27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3 -32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0 0 -5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13 35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0 0 24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9 -16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0 0 21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17 33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0 0 -3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2 -12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0 0 -18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13 -5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0 0 -26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45 -37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0 0 -27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42 -15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0 0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0 -40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28 -71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0 0 -25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56 -36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0 0 Christophe Roos - 5/6 Profiles, motifs, structures
Motif and databases – mode of use • Motifs can be used to search sequence databases • take a family of related sequences • align and define motifs • use the motifs to search a database of sequences to find novel family members • can also be generated from unaligned sequences (e.g. MEME, see next page) • Motif databases can be searched with sequences • take one sequence and ask what known motifs it contains • deduce its function using knowledge about those motifs in other sequences • DBs • Blocks, Fred Hutchinson Cancer Research Center (ungapped alignments) • COG, clusters of orthologous groups, NCBI (21 complete genomes) • Pfam, Sanger Center (gapped profiles, curated) • Prints, Univ. Manchester (fingerprints, i.e. more than one pattern) • Prosite, Univ. Geneva (consensus patterns, expert-curated) • SMART, EMBL-Heidelberg • IntePro, EBI (multiple, curated), includes Pfam, SMART, etc. [2 pages forward] Christophe Roos - 5/6 Profiles, motifs, structures
Motif discovery tools and PSSM creators • The MEME tool takes as input unaligned sequences and searches for patterns according to several parameters such as • Min-max length • Amount per sequence • Amount per set • MEME also generates PSSM for the found domains. • MAST is a tool for searching databases with PSSMs Christophe Roos - 5/6 Profiles, motifs, structures
The InterPro database of motifs at EBI • (Nov 2001) was built from Pfam 6.6, PRINTS 31.0, PROSITE 16.37, ProDom 2001.2, SMART 3.1, TIGRFAMs 1.2, and the current SWISS-PROT + TrEMBL data. This release of InterPro contains 4691 entries, representing 1068 domains, 3532 families, 74 repeats and 15 post-translational modification sites. Christophe Roos - 5/6 Profiles, motifs, structures
Scan the InterPro database - example • The InterPro database was scanned with the PAX6 sequence from the fruit fly. Christophe Roos - 5/6 Profiles, motifs, structures
Protein 3D structure • 3D is better than linear strings of letters... • Protein folding is critical for function • Protein folding is ordered • Structures consist of folds • 3D structure can be measured, but computational ab initio structure prediction is a tough task and nearly impossible above a certain protein size (cpu and rule limits) Christophe Roos - 5/6 Profiles, motifs, structures
Protein 3D structure building blocks • Primary structure: the linear array of aminoacids • Secondary structures • Alpha helix • Beta-strand • Tertiary structures DNA-binding protein (DNA helix, white; helices, pink; sheets of beta-strands, ocra) Christophe Roos - 5/6 Profiles, motifs, structures