130 likes | 263 Views
MCB3895-004 Lecture # 14 Oct 21/14. Genome alignment. Genome alignment. Basic problem: align long DNA sequences to each other Difficult to do computationally efficiently Note: different from BLAST, which is designed to find homologs between a query sequence and a database
E N D
MCB3895-004 Lecture #14Oct 21/14 Genome alignment
Genome alignment • Basic problem: align long DNA sequences to each other • Difficult to do computationally efficiently • Note: different from BLAST, which is designed to find homologs between a query sequence and a database • Poorly adapted to long sequences
Basic approach • Find small regions between two genomes that are (near-)perfectly matched • Extend these seed alignments in both directions until alignment quality drops • Join overlapping alignments • Calculate summary stats
MUMmer • The first widely-used genome alignment program, still current • Delcher et al. 2002 Nuc Acids Res 30:2478-2483 • Kurtz et al. 2004 Genome Biol. 5:R12 • http://mummer.sourceforge.net/
MUMmer • nucmer: nucleotide-nucleotide comparisons • promer: genome comparisons using translated protein sequences • Useful for comparing evolutionarily more divergent genomes • Protein sequences are more conserved than nucleotide sequences
Step #1: perform alignment • Command: $ nucmer <ref genome> <query genome> • e.g., $ nucmer GCF_000006765.1_ASM676v1_genomic.fna GCF_000152525.1_ASM15252v1_genomic.fna • produces out.delta output file by default (can be changed using -p flag)
Step #2: restrict output to non-overlapping hits • Command: $ delta-filter -1 <input .delta file> > <output file> • e.g., $ delta-filter -1 out.delta > filtered.delta
Step #3: create table of aligned regions • Command: $ show-cords <filtered delta file> > <output file> • e.g., $ show-cords filtered.delta > filtered.coords [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS] ===================================================================================== 1 51709 | 4285564 4233865 | 51709 51700 | 99.42 | NC_002516.2 NZ_CH482383.1 54035 54174 | 1574855 1574716 | 140 140 | 92.86 | NC_002516.2 NZ_CH482383.1 54038 54839 | 4232157 4231356 | 802 802 | 99.38 | NC_002516.2 NZ_CH482383.1 54114 54838 | 4229635 4228911 | 725 725 | 99.04 | NC_002516.2 NZ_CH482383.1 56087 68711 | 4228361 4215746 | 12625 12616 | 99.10 | NC_002516.2 NZ_CH482383.1 56086 56366 | 4232157 4231877 | 281 281 | 92.17 | NC_002516.2 NZ_CH482383.1 69537 88265 | 4215638 4196909 | 18729 18730 | 99.70 | NC_002516.2 NZ_CH482383.1 89162 155482 | 4196024 4129674 | 66321 66351 | 99.51 | NC_002516.2 NZ_CH482383.1 155580 245829 | 4129573 4039375 | 90250 90199 | 99.53 | NC_002516.2 NZ_CH482383.1 245830 260263 | 4038886 4024462 | 14434 14425 | 99.52 | NC_002516.2 NZ_CH482383.1 260434 262436 | 4024296 4022294 | 2003 2003 | 98.55 | NC_002516.2 NZ_CH482383.1 262478 264652 | 4022300 4020126 | 2175 2175 | 98.80 | NC_002516.2 NZ_CH482383.1 264639 264753 | 4020242 4020126 | 115 117 | 94.87 | NC_002516.2 NZ_CH482383.1 264841 286245 | 4020242 3998837 | 21405 21406 | 99.59 | NC_002516.2 NZ_CH482383.1 286332 291173 | 3998245 3993402 | 4842 4844 | 96.17 | NC_002516.2 NZ_CH482383.1 291398 370537 | 3993171 3914014 | 79140 79158 | 99.40 | NC_002516.2 NZ_CH482383.1 294592 297831 | 506430 509669 | 3240 3240 | 99.14 | NC_002516.2 NZ_CH482383.1
Step #4: create table of SNPs and indels in aligned regions • SNP: "single nucleotide polymorphism" • indel: "insertion or deletion" • Command: $ show-snps <filtered delta file> > <output file> • e.g., $ show-snpsfiltered.delta > filtered.snps [P1] [SUB] [P2] | [BUFF] [DIST] | [R] [Q] | [FRM] [TAGS] ================================================================================ 1553 C T 4284012 | 72 1553 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 1625 T C 4283940 | 72 1625 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 1798 . C 4283766 | 173 1798 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 2412 A G 4283152 | 597 2412 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3009 A C 4282555 | 399 3009 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3408 C T 4282156 | 30 3408 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3438 C A 4282126 | 18 3438 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3456 C T 4282108 | 18 3456 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3480 G A 4282084 | 15 3480 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3495 G A 4282069 | 12 3495 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3507 C T 4282057 | 12 3507 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3567 G C 4281997 | 60 3567 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3684 C T 4281880 | 108 3684 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3792 A C 4281772 | 81 3792 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1 3873 T C 4281691 | 18 3873 | 0 0 | 1 -1 NC_002516.2 NZ_CH482383.1
Applications • MUMmer is great for speed and easy-to-parse output tables • Not so much for graphical outputs • Mauve is another genome aligner that I prefer for graphics • Especially good for 1-to-many alignments • Darling et al. 2004 Genome Res. 14:1394-1403 • Darling et al. 2010 PLoSONE 5:e11147 • http://gel.ahabs.wisc.edu/mauve/
Mauve example • Collinear gene blocks linked between genomes • Darling et al. 2004 Genome Res. 14:1394-1403
Assignment #6 • Align the P. aeruginosa C3719 genome to PAO1 using nucmer • Write a Perl script that calculates the proportion of the C3719 genome that aligns to PAO1 • Write a second Perl script that separately counts the number of SNPs and indels between the C3719 and PAO1 genomes.
Also: To be done right now! • Register for a PacBio SMRT portal account on the Biotech server at this link: http://bbcsrv3.biotech.uconn.edu:8080/smrtportal/