260 likes | 348 Views
Alignments and phylogeny Peter Hantz EMBL Heidelberg. Evolution. Mutations: changes in the DNA sequence chemicals, physical conditions, replication errors (cca 10/replication/human genome) coding region mutations: >silent results in the same AA > mis-sense
E N D
Alignments and phylogeny Peter Hantz EMBL Heidelberg
Evolution Mutations: changes in the DNA sequence chemicals, physical conditions, replication errors (cca 10/replication/human genome) coding region mutations: >silent results in the same AA >mis-sense results in a different AA >nonsense results in a stop codon non-coding region mutations: e.g. Transcription factor binding sites - developmental deficiencies gene dublications + "neofunctionalization" of the extra copy new function - new evolutionary pressure
Evolution Homology, orthology, paralogy Homologous: genes having a common ancestor orthologous: common ancestor, in different species paralogous: common ancestor, gene dublication in the same species wikipedia
Alignments Aims: -to compare DNA or protein sequences that diverged during evolutionary processes Why is it good for? -a new piece of DNA/protein: what's it? (proteins: structural relationships) -to trace evolutionary relationships (phylogenetic trees) -to find mutations that cause diseases -for gene cloning: degenerate PCR primer design -forensics, whale protection...
Some applications... Forensics Alignments are the first step of building evolutionary trees A. Budd Malaria - Sickle Cell Anemia snp 23andme.com
Global and Local Alignments Global: assumed to be similar overall (e.g. closely related ones) "forces" the alignment to span the entire length of the sequences e.g. to find evolutionary relationships e.g. to find mutations in a gene Local: if short pieces (motifs, domains) are similar to identify regions of similarity within long sequences e.g. To find coding regions in a genomic dna full with introns Source:wikipedia
How a very simple pair-wise alignent works: the DotPlot method Source: Gene Cloning
Global/local alignmnets: Dynamic Programing G: Needleman-Wunsch algorithm L: Smith-Waterman algorithm Aligning DNA |: identical nothing: mismatch -: gap How do we quantify their similarity?
Aligning DNA, continued, concepts for both global and local alignments Measure of similarity: f(#of matches, mismatches, gaps) How do we count for identities, mismatches, gaps? A simple model (standard BLAST scoring matrix): Identity +2 Mismatch -3 Gap creation -5 Gap extension -2 Terminal Mismatch 0 All mismatches are "similar"? Kimura model: purine-purine and pirimidine-pirimidine : more probable Score of two aligned DNA sequences (or parts of them): TCCTAGGACTCATCGTAAGGTCCTAG - - AACCTCGTAAGG +2+2+2 +2+2+2 -5-2 -3 -3 +2 -2 -3 +2+2 +2 +2+2+2+2 =28-7-9=14
Aligning Proteins some subtitutions less relevant than other ones change to a similar AA: not too much change in the protein structure change to sg. else: unprobable, it will be selkected against letter =match, +=conservative substitution, Ø= non-cons. substitution, - =gap (software: Blast) Identities: same AA Positives: "similar" AA-s, incl. identities
How "similar" are these sequences? Measure of similarity: f(#of matches, # of cons/non-cons ch, #gaps) AA changes: quantification by the "substitution matrix" >the simplest one:the identity matrix >A more realistic one: BlockSUbstitutionMatrix (BLOSUM62): +probable, -unprobable, diagoinal:prob. let like this Gap penalties: usually -10 for gap open and -2 for gap extension The score: measure of similarity of a segment/the enitre length of two aligned protein sequences high score means: good alignment/real similarity score for a given alignment = symbol-wise score total (matrix) + gap penalty total A A B B C C D D - - E E F A A - - - - D D K K K E F G G 4+4 -10-2-2-2+6+6 -10-1+1+5+6 0 0 =
Significance: E-value Why score S is not enough? To what do we compare it? Statistics is needed. Ex. Given our query sequence (DNA or protein) Let's take a random sequence from a hypothetical database of size D: Can an alignment as good or better that this occur BY CHANCE? (calculated from a random database sequence) From the score S a so-called "Expectation value" is calculated E=P(S(random)>S(query))D If the chance is tiny (E<10^-6): unlikely that the observed alignment is due to chance alone Note: D is very high, the E-value increases
Multiple alignments: more complex than the pairwise ones "Progressive alignment methods" (e.g. Clustal) Iterative methods (e.g. Muscle) Multiple alignment of Nitric Oxide Synthase protein sequences (P. Hantz) Viagra! If two sequences have a large % of identity, they can be interpreted to be homologous (y/n) histons - very conservative MHC gene pool - evolves like crazy
The BLAST (Basic Local Alignment Search Tool) Scored pairwise local alignments are generated very fast "Is my new sequence related to sg I know about?" Also used for aligning sequences How does it work? In a nutshell: -List the words of length 3 (by default) of the query: PQGEFG >> PQG, GGE, GEF, EFG -Scan the database sequences with "the relevant ones" of these -try extending the exact matches, via local alignments, until there are not too much mismatches (until a score level) >> HSP-s (High-Scoring Segment Pairs) -Evaluate the significance and E-value of the HSP-s
Using it: BLAST sub-programs (beside the pairwise alignments) Blastn: Search a nucleotide database using a nucleotide queryWhat can my sequence be? – close relatives (What are the sequences similar to my sequence?) BlastP: Search protein database using a protein queryWhat can my protein be? (What are the proteins similar to my sequence?) Find conserved domains in the query Find members of the protein family Blastx: Search protein database using a translated nucleotide query Find coding sequences in a piece of genomic DNA Protein sequences: more conserved! Tblastn: Search translated nucleotide database using a protein query Find similar proteins e.g. not annotated DNA sequences Tblastx: Search translated nucleotide database using a translated nucleotide query Find coding sequences in a piece of genomic DNA Note: translation is done in all 6 frames, and all of these are locally analyzed
Making Phylogenetic Trees Aims: to show evolutinary relationships of: genes, species (they evolve ALL the time) even computer softwares... A. Budd Ji et al., 2008
The new rRNA-based animal phylogeny (1995) Annelida Mollusca Platyhelm. (1995) Deuterostomia Protostomia rRNA: present in all creatures slow/fast evolving parts secondary structure Halanych et al, Aduotte et al., after 1995
The new rRNA-based Tree of Life Woese et al., after 1990
Description of Evolutionary Trees Internal nodes: hypothetical ancestral organisms/genes Terminal nodes: existing organisms/genes (Operational Taxonomy Units, OTU) Root: the last common ancestor of the entire group Sister groups: on either side of a split, with a common ancestor and no additional descendents Monophyletic group A group containing an ancestor and all of its descendants most recent common ancestor of the group Only the terminal nodes (OTU) exist right now They all evolve in time!
Description of Phylogenetic Trees Cladogram: branch length unscaled Phylogram: branch length=amounts of evol. divergence (horizontals doesn't count) Biology: sequences, organisms evolve all the time time Molecular clocks: Can the # of mutations of a DNA or protein sequence correlated with the time lapsed after the Last Common Ancestor? If the rate is cca constant: YES Different rates - different advantages GENE PHYLOGENY ≠ SPECIES PHYLOGENY too fast: saturation - back mutations - underestimation of the distance: A-B-C-D-E-A-... calibration: some dated fossil records are needed
Building Phylogenetic trees: A very simple example: One step-one mutation (a) MSATHC (b) ITATHC (c) ITAGHC (d) LTAAHC Mutations (a)<>(b)<>(c) Mutations (a)<>(d)<>(b) A "rooted" tree: an assumption for the commn ancestor source: Gene Cloning
Building Phylogenetic Trees Distance-based methods "Distance" of two sequences: "metric" in mathematics (4 axioms) several ones: euclidean... non-euclidean... Green: Euclidean distance Others: "Manhattan" distance A distance between two strings: Leveshtein distance d(IJ) the minimum number of edits (insertion, deletion, or substitution of a single character) needed to transform one string into the other Its calculation: intuitively easy, practically complicated (dynamic programing) Example: d(kitten/6/, sitting/7/)=3 Leveshtein distance: a special case of the score: gaps/mismatches/missing ends: 1; matches: 0 kitten → sitten ( 's' for 'k') sitten → sittin ('i' for 'e') sittin → sitting (insert 'g').
The Building of a Tree: the UPGMA method Sequential clustering method Starting point: distance matrix [d(IJ)] of the sequences (triangular m.) -grouping the pairwise distances corresponding to the pairs of strings with the with the smallest pairwise distance: -node "placed" at the half of the distance -creating a reduced matrix: these two are "joined" distances are re-calculated:
FIRST DO AN ALIGNMENT Pre-processing: Eliminate obiously wrong sequence regions e.g. "forgotten" introns when investigating proteins, AG|GT...AG|G e.g. obvious sequencing errors e.g. bad sequences Correct the distances for multiple substitutions (homoplasy) (measure of distances: change/site, 0…1) Building a tree use a program... Rooting a tree Rooting induces a directionality >"automatically" done by several software: midpoint rooting (root on the branch on equal distance from the most distant OTU-s) >"by hand" by choosing an "outgroup": a homologous, but "quite far" sequence root: on the branch between the tree and the outgroup Problems might still appear!...
Sample3 A A A A A A A A A A A A A A A A A A A A Bootstrap values: how roboust is our tree Randomizing a bit the sequences.... does the tree/subtree persist? Larger% - better Sample1 Sample2 Resample datasets (with replacement) Taxa1 Taxa2 Taxa3 Taxa4 G G C C T A A T A A A A T A A T G G C C A A A A C C G G A A T T A A T T G G C C A A A A C C G G T A A T G G C C A A T T 1 3 2 4 Sample99 Sample100 The result: G G C C T A A T G G C C A A A A T A A T T A A T T A A T A A A A C C G G C C G G ... 60% T. Larsson