610 likes | 868 Views
mauve methods for multiple genome alignment. Aaron Darling UPC, Barcelona 25/07/2005. Living vicariously through the wild XXX lives of bacteria. Aaron Darling UPC, Barcelona 25/07/2005. 1960’s “sexual revolution” in the USA.
E N D
mauvemethods for multiple genome alignment Aaron Darling UPC, Barcelona 25/07/2005
Living vicariously through the wild XXX lives of bacteria Aaron Darling UPC, Barcelona 25/07/2005
1960’s “sexual revolution” in the USA • Dr. Alfred Kinsey sparks the revolution in the early ’50s with the “Kinsey reports”
Lederberg wins nobel prize The microbial “sexual revolution” • Early ’50s: Joshua Lederberg and Bill Hayes discover bacterial sex! DNA can be transferred between bacteria during conjugation
Bacteria lead promiscuous lives Conjugation – transfer of genetic content from an F+ donor
Bacteria lead promiscuous lives Conjugation – transfer of genetic content from an F+ donor Adsorption – transfer of genetic content mediated by a virus. Phage transduction
Bacteria lead promiscuous lives The most promiscuous of all: • Transformation – acquisition of floating DNA from the environment Occurs frequently during high stress
Genome evolution is complicated Simple mutations SNPs Ú Changes in a single nucleotide Indels Ú Small insertions or deletions of new DNA Large-scale mutations Inversion Ú Reversal of a region of DNA Horizontal Transfer Ú Transformation, Transduction and Conjugation (bacterial sex) Homologous recombination – a special case of HT Gene Duplication/Loss Problem: Genetic elements may not have conserved order and orientation in other genomes
Bacterial comparative genomics:Why bother? • Global biomass is largely microbial • Microbes can be pathogenic • Possible renewable energy source from photosynthetic bacteria • Human gut contains thousands of bacterial species • Unique genetic content of gut microbes is over 100 times the human genome!
Case Study: 9 Enterobacteria Genome sequences for these 9 Enterobacteria have been published recently. Diverse (non) pathogenic phenotypes, some can kill Why do they differ?
Pairwise global alignment Scales O(n2) where n is the sequence length Multiple alignment scales O(nm) where m is the number of Sequences. Problem: too time-consuming, doesn’t consider rearrangements
The anchored alignment idea Restrict the search to parts of the DP matrix that are very likely to be part of the optimal path Each diagonal ‘band’ is a high-scoring local alignment of the sequences. The highest scoring chain of local alignments become anchors
Anchored genome alignment tools Multi-LAGAN – align two or more heavily diverged genomes, assuming no differential gene content and no rearrangements (Brudno et. al. 2003) MAVID – Like Multi-LAGAN, but also infer the branching structure of the organism’s phylogeny (Bray et. al. 2004) Shuffle-LAGAN – align two genomes that may contain repeats and rearrangements, no differential gene content (Brudno et. al. 2003) Mauve – align two or more closely related genomes that have rearrangements, differential content in conserved order and orientation (Darling et. al. 2004) Mulan – align two or more closely related genomes, possibly with differential gene content (Ovcharenko et. al. 2005) M-GCAT – align two or more closely related genomes with rearrangements and other changes (Treangen et. al. 2005)
The two component architecture of Mauve We use each language for what it does best–C++ for efficient algorithm implementation, Java for a cross platform GUI Java 1.4+ Interactive Visualization C++ command-line aligner Windows, Linux, Mac OS X GenBank or FastA sequences alignments 100% Free/Open Source Software
The Mauve alignment approach Step 1. Compute local multi-alignments Each set of linked boxes is a high-scoring local alignment. Boxes below a genome’s center line are in the reverse-complement orientation (inverted)
The Mauve alignment approach Need to filter out matches that arise due to random sequence similarity (or paralogy)
The Mauve alignment approach Use breakpoint analysis to identify Locally Collinear Blocks – groups of anchors with conserved order and orientation
Greedy Breakpoint Elimination Remove matches caused by random similarity: Block 3 (yellow) is small and has a weight less than w so it is removed.
Greedy Breakpoint Elimination A breakpoint is eliminated: - When block 3 is removed, blocks 2 and 4 coalesce. - Final step: align grey regions progressively (with MUSCLE or Clustal-W)
Results: Homology structure of 9 Enterobacteria An alignment
E. coli K12 MG1655 E. coli O157:H7 EDL933 E. coli O157:H7 VT-2 Sakai E. coli CFT073 S. flexneri 2A S. flexneri 2A 2457T S. enterica Typhimurium S. enterica Typhi CT18 S. enterica Typhi Ty2 Alignment of 9 Enterobacteria • 45 locally collinear blocks (LCBs) • 2.86Mbp of backbone sequence – only 58% of average genome size • Backbone is any region shared among all genomes • Diverse phenotypes caused by horizontal gene transfer • 3 hours compute time on a 1.6 GHz Linux PC
Evaluating alignment quality If we had a ‘correct’ alignment, we could compare the calculated alignments to the correct alignment Problem: how can know the correct alignment? Collecting ancestral DNA sequences is often impossible. Evolutionary changes are not becoming fixated in populations quickly enough to be observable Fundamental limitation of the science
Evaluating alignment accuracy Let’s play God! • Design a model of genome evolution based on the types of changes observed in bacteria, • Simulate the evolution of a set of genomes according to the model, • Track the conserved nucleotides during the simulation • End up with a set of evolved genomes, and a correct alignment
The simple genome evolver Supports: - Nucleotide substitution, HKY model (via SeqGen) - Indels, poisson distributed around 3 bp - Small H.T., length exp. distributed, mean 200 - Large H.T., uniformly distributed 10kbp-60kbp - Inversions, length exp. distributed around 50kbp Doesn’t Support: - Duplication (gene gain/loss) - An uneven distribution of indel, H.T. or inversion sites
Input and output of the evolver Given ancestral sequence:ACCATGGTAT And tree: indels, H.T., and inversions are applied in a similar manner sgEvolver would produce a sequence alignment: A A: ACCATGGTAT 0.1 B: ACCCTGGTAT C: ACCCTGCTAT 1: ACCCTGGTAT ACCCTGGTAT B 2: AGCCTGCTAT 0.3 3: CCCCTGGTAA 0.1 4: ATTCTGGTAT 0.2 ACCCTGCTAT C 0.1 0.1 3 4 2 1 AGCCTGCTAT CCCCTGGTAA ATTCTGGTAT ACCCTGGTAT
A: ACCCTGGTATA-CCCTGGTAT A: ACCCTGGTATACCCTGGTAT B: ACCCTGCT----CCCTGCTAT B: ACCCTGCT---CCCTGCTAT C: ACCCTGGTAT--CCCTGGTAT C: ACCCTGGTAT-CCCTGGTAT D: AGCCTGCTAT--GCCTGCTAT D: AGCCTGCTAT-GCCTGCTAT E: CCCCTGGTA-ACCCCTGGTAA E: CCCCTGGTAACCCCTGGTAA F: ATTCTGGTA---TTCTGGTAT F: ATTCTGGTA--TTCTGGTAT Scoring alignment accuracy Evolver outputs the correct alignment: Mauve calculates an alignment: Each time Mauve aligns a pair of sequence positions that are also aligned in the correct alignment, it gets a point! It also gets a point for correctly aligning a character to a gap. Accuracy = total points / total possible
Experiments 1. No large scale events - nucleotide substitutions and indels - Mauve vs. Multi-LAGAN 2. Inversions and nucleotide substitutions - pairwise, Mauve vs. Shuffle-LAGAN 3. Similar to the 9 E. coli we are studying
Simulation parameters For multiple alignments: • 9 taxa • 1 MB sequence length • Data evolved according a midpoint rooted version of the Enterobacterial guide tree • Scores averaged over 3 simulations for each set of parameters For pairwise alignments: • Same but with 2 taxa
No genome rearrangement or HT • Mauve is less sensitive to nucleotide substitution than Multi-LAGAN. • The more genomes aligned, the less sensitive Mauve becomes. • Using inexact seed alignments helps Mauve tremendously
2. Inversions and NT substitutions • Mauve clearly outperforms Shuffle-LAGAN when the nucleotide substitution rates are tolerably low • Surprise! Shuffle-LAGAN improves with NT substitution. Mauve Shuffle-LAGAN
Quality Evaluation Real World Application Lessons Learned 4. Enterobacteria-like data Small HT events have little effect compared to large HT events When scored on regions conserved in all 9 taxa, accuracy is always > 98% Mauve:
Using the alignment Can we infer a species phylogeny on the conserved regions? NO! Can anybody guess why? Horizontal transfer has replaced genes with a homologous copy having a different phylogenetic history
Acknowledgements The giants upon whose shoulders I stand: • Paul Infield-Harm (Java wizard) • Bob Mau (collaborator) • Nicole Perna (my Ph.D. advisor) • Mark Craven • Carla Kuiken (LANL collaborator, parallelization/scalability) The people who brought me here: • Todd Treangen • Xavier Messeguer Countless Mauve users who have sent in bug reports Google Image search and the forces of Nature
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand 1 AAT - 6 AAT - 10 ACC - sort 3 AGA - 5 ATA - 9 CCG - 8 CGA - 4 CTA + 7 GAA - 2 GAA - Intro Finding Matches with Sorted Mer Lists A Sorted Mer List (SML) is a data structure that allows us to simultaneously find matches on forward and reverse strands of DNA SML construction for the sequence: using 3-mers Forward: ATTCTATTCGGT Reverse: TAAGATAAGCCA Position Mer Strand 1 AAT - 2 GAA - 3 AGA - 4 CTA + 5 ATA - 6 AAT - 7 GAA - 8 CGA - 9 CCG - 10 ACC -
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 1 AAT - 7 AAC - 4 AAT - 3 AAT - 1 ACC - 6 AAT - 6 ACG - 10 ACC - 8 ACC - 3 ATA - 3 AGA - 1 ATA + 2 ATA - 7 CCG - 5 ATA - 9 CCG - 6 CGA - 10 CAA + 5 GAA - 5 CGA - 8 CGA - 4 GAA - 2 GTA + 4 CTA + 9 GTA + 7 GAA - 8 GAA - 2 GAA - 9 TCA + Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA No matches with other sequences
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 1 AAT - 7 AAC - 4 AAT - 3 AAT - 1 ACC - 6 AAT - 6 ACG - 10 ACC - 8 ACC - 3 ATA - 3 AGA - 1 ATA + 2 ATA - 7 CCG - 5 ATA - 9 CCG - 6 CGA - 10 CAA + 5 GAA - 5 CGA - 8 CGA - 4 GAA - 2 GTA + 4 CTA + 9 GTA + 7 GAA - 8 GAA - 2 GAA - 9 TCA + Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Mer seeds are not unique
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 1 AAT - 7 AAC - 4 AAT - 3 AAT - 1 ACC - 6 AAT - 6 ACG - 10 ACC - 8 ACC - 3 ATA - 3 AGA - 1 ATA + 2 ATA - 7 CCG - 5 ATA - 9 CCG - 6 CGA - 10 CAA + 5 GAA - 5 CGA - 8 CGA - 4 GAA - 2 GTA + 4 CTA + 9 GTA + 7 GAA - 8 GAA - 2 GAA - 9 TCA + Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Mer seeds are not unique
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 7 AAC - 4 AAT - 1 AAT - 6 AAT - 1 ACC - 3 AAT - 8 ACC - 10 ACC - 6 ACG - 1 ATA + 3 ATA - 3 AGA - 7 CCG - 2 ATA - 5 ATA - 9 CCG - 6 CGA - 10 CAA + 5 CGA - 8 CGA - 5 GAA - 4 GAA - 4 CTA + 2 GTA + 7 GAA - 8 GAA - 9 GTA + 9 TCA + 2 GAA - Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA No matches with other sequences
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 4 AAT - 1 AAT - 7 AAC - 3 AAT - 6 AAT - 1 ACC - 6 ACG - 10 ACC - 8 ACC - 1 ATA + 3 AGA - 3 ATA - 7 CCG - 2 ATA - 5 ATA - 10 CAA + 6 CGA - 9 CCG - 5 CGA - 8 CGA - 5 GAA - 4 CTA + 2 GTA + 4 GAA - 9 GTA + 8 GAA - 7 GAA - 9 TCA + 2 GAA - Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA No matches with other sequences
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 4 AAT - 7 AAC - 1 AAT - 6 AAT - 3 AAT - 1 ACC - 6 ACG - 10 ACC - 8 ACC - 3 AGA - 1 ATA + 3 ATA - 2 ATA - 7 CCG - 5 ATA - 6 CGA - 10 CAA + 9 CCG - 5 GAA - 5 CGA - 8 CGA - 2 GTA + 4 GAA - 4 CTA + 8 GAA - 7 GAA - 9 GTA + 2 GAA - 9 TCA + Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Mer seeds are not unique
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 4 AAT - 7 AAC - 1 AAT - 6 AAT - 3 AAT - 1 ACC - 6 ACG - 10 ACC - 8 ACC - 3 AGA - 1 ATA + 3 ATA - 2 ATA - 7 CCG - 5 ATA - 6 CGA - 10 CAA + 9 CCG - 5 GAA - 5 CGA - 8 CGA - 2 GTA + 4 GAA - 4 CTA + 8 GAA - 7 GAA - 9 GTA + 2 GAA - 9 TCA + Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA No matches with other sequences
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 4 AAT - 1 AAT - 7 AAC - 3 AAT - 6 AAT - 1 ACC - 6 ACG - 10 ACC - 8 ACC - 1 ATA + 3 AGA - 3 ATA - 7 CCG - 2 ATA - 5 ATA - 10 CAA + 6 CGA - 9 CCG - 5 CGA - 8 CGA - 5 GAA - 4 CTA + 2 GTA + 4 GAA - 9 GTA + 8 GAA - 7 GAA - 9 TCA + 2 GAA - Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Unique matching mers in sequences 1 and 3: Seed a subset match.
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Intro Match Seed Extension If a unique seed is not part of a known match, it is extended into the surrounding region until a mismatch occurs. Currently known MUMs: <none> The extension for match seed < 3, 9, 0, 7 > would be: Initial Seed Extended Subset Match ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Each MUM can be written in the form: < Length, Start1, … , Startn > The extended subset match would be: < 8, 5, 0, 3 >
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 4 AAT - 1 AAT - 7 AAC - 3 AAT - 6 AAT - 1 ACC - 6 ACG - 10 ACC - 8 ACC - 1 ATA + 3 AGA - 3 ATA - 7 CCG - 2 ATA - 5 ATA - 10 CAA + 6 CGA - 9 CCG - 5 CGA - 8 CGA - 5 GAA - 4 CTA + 2 GTA + 4 GAA - 9 GTA + 8 GAA - 7 GAA - 9 TCA + 2 GAA - Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Unique matching mers in sequences 1 and 3: Seed a subset match.
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Position Mer Strand Position Mer Strand Position Mer Strand 4 AAT - 1 AAT - 7 AAC - 3 AAT - 6 AAT - 1 ACC - 6 ACG - 10 ACC - 8 ACC - 1 ATA + 3 AGA - 3 ATA - 7 CCG - 2 ATA - 5 ATA - 10 CAA + 6 CGA - 9 CCG - 5 CGA - 8 CGA - 5 GAA - 4 CTA + 2 GTA + 4 GAA - 9 GTA + 8 GAA - 7 GAA - 9 TCA + 2 GAA - Intro Match Seeds Using one SML for each sequence in a comparison, seed matches are found. Finding seed matches in the sequences: ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Unique matching mers in sequences 1, 2, and 3: Seed a match.
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Intro Match Seed Extension Is the seed < 3, 8, 5, 6 > part of a currently known match? No. Currently known MUMs: < 8, 5, 0, 3 > The extension for match seed < 3, 8, 5, 6 > would be: Initial Seed Extended Match Subset Match Seed ATTCTATTCGGT ATATTCGTTCAA GGTATTCGGTA Each MUM can be written in the form: < Length, Start1, … , Startn > The extended match would be: < 6, 5, 2, 3 > The subset match seed would be: < 7, 5, 0, 3 >
Bacterial Evolution Genome Comparison Finding Homology Detecting Rearrangements Intro Subset Match Seed Extension Is the seed < 7, 5, 0, 3 > part of a currently known match? Yes. Currently known MUMs: < 8, 5, 0, 3 >, < 6, 5, 2, 3 > Subset Linking Because the subset match < 8, 5, 0, 3 > and the 3-way match < 6, 5, 2, 3 > share some sequence coverage they are “linked” to each other.