910 likes | 943 Views
Learn about genome rearrangements like reversals, fusions, and translocations. Explore hurdles in rearrangement problems, median problems, and phylogeny analysis. Discover protein biochemistry tools for in-silico analysis.
E N D
1 2 3 9 10 8 4 7 5 6 Reversals • Blocks represent conserved genes. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Reversals 1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 • Blocks represent conserved genes. • In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10.
Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 45 6 1 2 6 4 5 3 Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission
Sorting by reversals: 4 steps What is the reversal distance for this permutation? Can it be sorted in 3 steps?
From Signed to Unsigned Permutation (Continued) • Construct the breakpoint graph as usual • Notice the alternating cycles in the graph between every other vertex pair • Since these cycles came from the same signed vertex, we will not be performing any reversal on both pairs at the same time; therefore, these cycles can be removed from the graph 0 5 610 915 1612 117 814 1317 183 41 219 2022 21 23
Reversal Distance with Hurdles • Hurdles are obstacles in the genome rearrangement problem • They cause a higher number of required reversals for a permutation to transform into the identity permutation • Let h(π) be the number of hurdles in permutation π • Taking into account of hurdles, the following formula gives a tighter bound on reversal distance: d(π) ≥ n+1 – c(π) + h(π)
Median Problem Goal: find Mso that DAM+DBM+DCMis minimized NP hard for most metric distances
Genome Enumeration for Multichromosome Genomes Genome Enumeration For genomes on gene {1,2,3} 2 2 2 -2 -2 -2
Experimental Results (Equal Content) 80% inversion, 20% transposition
An Example—New Genomes 1 2 3 4 5 6 7 8 9 10 1 -4 5 2 8 10 9 -7 -6 3 … 1 3 5 7 9 1 5 9 -7 3 …
Support Value Threshold - FP Up to 90% FP can be identified with 85% as the threshold
Jackknife Properties • Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified • 40% jackknifing rate is reasonable • 85% is a conservative threshold, 75% can also be used • Low support branches should be examined in detail
In-silico Biochemistry • Online servers exist to determine many properties of your protein sequences • Molecular weight • Extinction coefficients • Half-life • It is also possible to simulate protease digestion • All these analysis programs are available on • www.expasy.ch
Analyzing Local Properties • Many local properties are important for the function of your protein • Hydrophobic regions are potential transmembrane domains • Coiled-coiled regions are potential protein-interaction domains • Hydrophilic stretches are potential loops • You can discover these regions • Using sliding-widow techniques (easy) • Using prediction methods such as hidden Markov Models (more sophisticated)
Sliding-window Techniques • Ideal for identifying strong signals • Very simple methods • Few artifacts • Not very sensitive • Use ProtScale on www.expasy.org • Make the window the same size as the feature you’re looking for
www.expasy.org/cgi-bin/protscale.pl Hphob. / Eisenberg
Transmembrane Domains • Discovering a transmembrane domain tells you a lot about your protein • Many important receptors have 7 transmembrane domains • Transmembrane segments can be found using ProtScale • The most accurate predictions come from using TMHMM
Using TMHMM • TMHMM is the best method for predicting transmembrane domains • TMHMM uses an HMM • Its principle is very different from that of ProtScale • TMHMM output is a prediction
>sp|P78588|FREL_CANAX Probable ferric reductase transmembrane component OS=Candida albicans GN=CFL1 PE=3 SV=1 MTESKFHAKYDKIQAEFKTNGTEYAKMTTKSSSGSKTSTSASKSSKSTGSSNASKSSTNA HGSNSSTSSTSSSSSKSGKGNSGTSTTETITTPLLIDYKKFTPYKDAYQMSNNNFNLSIN YGSGLLGYWAGILAIAIFANMIKKMFPSLTNNLSGSISNLFRKHLFLPATFRKKKAQEFS IGVYGFFDGLIPTRLETIIVVIFVVLTGLFSALHIHHVKDNPQYATKNAELGHLIADRTG ILGTFLIPLLILFGGRNNFLQWLTGWDFATFIMYHRWISRVDVLLIIVHAITFSVSDKAT GKYKNRMKRDFMIWGTVSTICGGFILFQAMLFFRRKCYEVFFLIHIVLVVFFVVGGYYHL ESQGYGDFMWAAIAVWAFDRVVRLGRIFFFGARKATVSIKGDDTLKIEVPKPKYWKSVAG GHAFIHFLKPTLFLQSHPFTFTTTESNDKIVLYAKIKNGITSNIAKYLSPLPGNTATIRV LVEGPYGEPSSAGRNCKNVVFVAGGNGIPGIYSECVDLAKKSKNQSIKLIWIIRHWKSLS WFTEELEYLKKTNVQSTIYVTQPQDCSGLECFEHDVSFEKKSDEKDSVESSQYSLISNIK QGLSHVEFIEGRPDISTQVEQEVKQADGAIGFVTCGHPAMVDELRFAVTQNLNVSKHRVE YHEQLQTWA Search with Accession number P78588 http://www.uniprot.org/uniprot/
Predicting Post-translational Modifications • Post-translational modifications often occur on similar motifs in different proteins • PROSITE is a database containing a list of known motifs, each associated with a function or a post-translational modification • You can search PROSITE by looking for each motif it contains in your protein (the server does that for you!) • PROSITE entries come with an extensive documentation on each function of the motif
Searching for PROSITE Patterns • Search your protein against PROSITE on ExPAsy • www.expasy.org/tools/scanprosite • PROSITE motifs are written as patterns • Short patterns are not very informative by themselves • They only indicate a possibility • Combine them with other information to draw a conclusion • Remember: Not everything is in PROSITE !
Interpreting PROSITE Patterns • Check the pattern function: Is it compatible with the protein? • Sometimes patterns suggest nonexistent protein features • For instance : If you find a myristoylation pattern in a prokaryote, ignore it; prokaryotic proteins have no myristoylation ! • Short patterns are more informative if they are conserved across homologous sequences • In that case, you can build a multiple-sequence alignment • This slide shows an example
Patterns and Domains • Patterns are usually the most striking feature of the more general motifs (called domains) • Domains are less conserved than patterns but usually longer • In proteins, domain analysis is gradually replacing pattern analysis
Protein Domains • Proteins are usually made of domains • A domain is an autonomous folding unit • Domains are more than 50 amino acids long • It’s common to find these together: • A regulatory domain • A binding domain • A catalytic domain
Discovering Domains • Researchers discover domains by • Comparing proteins that have similar functions • Aligning those proteins • Identifying conserved segments • A domain is a multiple-sequence alignment formulated as a profile • For each column, a domain indicates which amino acid is more likely to occur
Domain Collections • Scientists have been discovering and characterizing protein domains for more than 20 years • 8 collections of domains have been established • Manual collections are very precise but small • Automatic collections are very extensive but less informative • These collections • Overlap • Have been assembled by different scientists • Have different strengths and weaknesses • We recommend using them all!
The Magnificent 8 • Pfam is the most extensive manual collection • Pfam is often used as a reference