1 / 35

Whole Genome Alignment

Whole Genome Alignment. MUMmer and Alignment October 2 nd , 2007 Adam M Phillippy amp@cs.umd.edu. CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA. CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA. Goal of WGA.

Download Presentation

Whole Genome Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whole Genome Alignment MUMmer and Alignment October 2nd, 2007 Adam M Phillippy amp@cs.umd.edu

  2. CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA CCGGTAGGCTATTAAACGGGGTGAGGAGCGTTGGCATAGCA Goal of WGA • For two genomes, A and B, find a mapping from each position in A to its corresponding position in B 41 bp genome

  3. CCGGTAGGATATTAAACGGGGTGAGGAGCGTTGGCATAGCA CCGCTAGGCTATTAAAACCCCGGAGGAG....GGCTGAGCA Not so fast... • Genome A may have insertions, deletions, translocations, inversions, duplications or SNPs with respect to B (sometimes all of the above)

  4. Visualization • How can we visualize alignments? • With an identity plot • XY plot • Let x = position in genome A • Let y = %similarity of Ax to corresponding position in B • Plot the identity function • This can reveal islands of conservation, e.g. exons

  5. Identity plot example

  6. WGA visualization • How can we visualize whole genome alignments? • With an alignment dot plot • N x M matrix • Let i = position in genome A • Let j = position in genome B • Fill cell (i,j) if Ai shows similarity to Bj • A perfect alignment between A and B would completely fill the positive diagonal

  7. B A B A Translocation Inversion Insertion http://mummer.sourceforge.net/manual/AlignmentTypes.pdf

  8. Global vs. Local • Global pairwise alignment ...AAGCTTGGCTTAGCTGCTAGGGTAGGCTTGGG... ...AAGCTGGGCTTAGTTGCTAG..TAGGCTTTGG... ^ ^ ^^ ^ • Whole genome alignment • Often impossible to represent as a global alignment • We will assume a set of local alignments • This works great for draft sequence

  9. global ok global no way Global vs. Local

  10. Alignment tools • Whole genome alignment • MUMmer* (nucmer) • Developed, supported and available at TIGR • LAGAN*, AVID • VISTA identity plots • Multiple genome alignment • MGA, MLAGAN*, DIALIGN, MAVID • Multiple alignment • Muscle*, ClustalW* • Local sequence alignment • BLAST*, FASTA, Vmatch *open source

  11. MUMmer • Maximal Unique Matcher (MUM) • match • exact match of a minimum length • maximal • cannot be extended in either direction without a mismatch • unique • occurs only once in both sequences (MUM) • occurs only once in a single sequence (MAM) • occurs one or more times in either sequence (MEM)

  12. Fee Fi Fo Fum,is it a MAM, MEM or MUM? MUM : maximal unique match MAM : maximal almost-unique match MEM : maximal exact match R Q

  13. Seed and extend • How can we make MUMs BIGGER? • Find MUMs using a suffix tree • Cluster MUMs using size, gap and distance parameters • Extend clusters using modified Smith-Waterman algorithm

  14. Seed and extend FIND all MUMs CLUSTER consistent MUMs EXTEND alignments R Q

  15. atgtgtgtc$ $ c$ t gt 10 9 1 gt c$ c$ gt 8 7 c$ gtc$ c$ gt 5 3 6 c$ gtc$ 4 2 Suffix Tree for atgtgtgtc$ Drawing credit: Art Delcher

  16. R m1 m2 m3 A B Q C Clustering cluster length = ∑mi gap distance = C indel factor = |B – A| / B or |B – A|

  17. break point = B B A break length = A Extending R score ~70% Q

  18. B ^ T T G C A G ^ 1 2 3* 5 6 4 T 1 0 1 2 3 4 5 4 G 2 1 1 1 2 3 A 2 C 3* 2 2 2 1 3 2 T 2 3* 2 3* 3 4 2 G 5 4 3 3 3* 2 Banded alignment 0

  19. mummer exact matching nucmer DNA multi-FastA input whole genome alignment promer DNA multi-FastA input whole genome alignment run-mummer1 FastA input global alignment run-mummer3 FastA input w/ draft whole genome alignment exact-tandems FastA input exact tandem repeats NUCmer / PROmer utilities mapview alignment plotter draft sequence mapping mummerplot alignment visualization show-coords alignment summary delta-filter alignment filter show-aligns pairwise alignments System utilities gnuplot xfig MUMmer suite

  20. mummer • Primary uses • exact matching (seeding) • dot plotting • Pros • very efficient O(n) time and space • ~17 bytes per bp of reference sequence • E. coli K12 vs. E. coli O157:H7 (~5Mbp each) • 17 seconds using 77 MB RAM • multi-FastA input • Cons • exact matches only

  21. nucmer • Primary uses • whole genome alignment and analysis • draft sequence alignment • Pros • multi-FastA inputs • well suited for genome and contig mapping • convenient helper utilities • show-coords, delta-filter, mummerplot • Cons • low sensitivity (w\ default parameters) with respect to BLAST

  22. WGA example • Yersina pestis CO92 vs. Yersina pestis KIM • High nucleotide similarity, 99.86% • Two strains of the same species • Extensive genome shuffling • Global alignment will not work • Highly repetitive • Will confuse local alignment (e.g. BLAST)

  23. nucmer –maxmatch CO92.fasta KIM.fasta -maxmatch Find maximal exact matches (MEMs) delta-filter –m out.delta > out.filter.m -m Many-to-many mapping -1 One-to-one mapping show-coords -r out.delta.m > out.coords -r Sort alignments by reference position mummerplot --large --fat out.delta.m --large Large plot --fat Nice layout for multi-fasta files --x11 Default, draw using x11 (--postscript, --png) *requires gnuplot COMMANDwhole genome alignment

  24. show-coords output • [S1] start of the alignment region in the reference sequence • [E1] end of the alignment region in the reference sequence • [S2] start of the alignment region in the query sequence • [E2] end of the alignment region in the query sequence • [LEN 1] length of the alignment region in the reference sequence • [LEN 2] length of the alignment region in the query sequence • [% IDY] percent identity of the alignment • [% SIM] percent similarity of the alignment • [% STP] percent of stop codons in the alignment • [LEN R] length of the reference sequence • [LEN Q] length of the query sequence • [COV R] percent alignment coverage in the reference sequence • [COV Q] percent alignment coverage in the query sequence • [FRM] reading frame for the reference and query sequence alignments respectively • [TAGS] the reference and query FastA IDs respectively. • All output coordinates and lengths are relative to the forward strand

  25. Comparative assembly • Assembly • Orient and place sequencing reads • Using overlaps and mate-pair information • Scaffolding • Order and orient draft contigs • Using mate-pair information and experimental validation • Comparative assembly and scaffolding • Orient and place reads and contigs • Using a reference genome and alignment mapping • e.g. AMOScmp (nucmer)

  26. Comparative assembly mate-pairs physical map reference genome homology map

  27. Comparative assemblycaveats Finished Un-finished A B A B A B

  28. nucmer –maxmatch REF.fasta QRY.fasta -maxmatch Find maximal exact matches (MEMs) REF Reference sequence (genome) QRY Query sequence to be mapped (reads, contigs) delta-filter –q out.delta > out.delta.q -q Best one-to-one mapping for each query show-coords –rcl out.delta.q > out.coords -r Sort alignments by reference -c Display alignment coverage percentage -l Display sequence length COMMANDnucmer read/contig mapping

  29. Arachne vs. CA Drosophila virilis Multiple CA contigs (Y) mapping to a single Arachne contig (X) 9kb insertion 5kb translocation

  30. Gene reassembly* • nucmer -maxmatch genes.fasta reads.fasta • delta-filter -q out.delta.q • show-coords -THqcl out.delta.q > out.coords • Define matching criteria • %identity, %coverage, max gap size • Assemble matching reads • AMOS, minimus, hawkeye

  31. References • Documentation • http://mummer.sourceforge.net • publication listing • http://mummer.sourceforge.net/manual • thorough documentation • http://mummer.sourceforge.net/examples • walkthroughs • Email • mummer-help (at) lists.sourceforge.net • mummer-users (at) lists.sourceforge.net

  32. Acknowledgements Art Delcher Steven Salzberg Mihai Pop Mike Schatz Stefan Kurtz

More Related