340 likes | 612 Views
Multiple alignment method. Find homologous sequences Place the sequences in a relevant format (usually FASTA), and edit to similar length. Example
E N D
Multiple alignment method • Find homologous sequences • Place the sequences in a relevant format (usually FASTA), and edit to similar length. • Example • >ACTB cggcctccagatggtctgggagggcagttcagctgtggctgcgcatagcagacatacaacggacggtgggcccagacccaggctgtgtagacccagcccccccgccccgcagtgcctaggtcacccactaacgccccaggccttgtcttggctgggcgtgactgttaccctcaaaagcaggcagctccagggtaaaaggtgccctgccctgtagagcccaccttccttcccagggctgcggctgggtaggtttgtagccttcatcacgggccacctccagccactggaccgctggcccctgccctgtcctggggagtgtggtcctgcgacttctaagtggccgcaagcca • >AGPAT1 tctgcctctccacagtgcccttataccagccccctcccagatctcatctgaatgtgatccatatttcctggttctccccgactcaactgatgcgtgcctcccttaacctttgtgtctcacttgtttccacctgcacagctaagacccctcacttctctggggtaaggtggctcgggtctcacattgtcctgccactccccgccccaccttctcttctcagcacatcacgtgcctcagctcctggttcctaagacctttctttccacagatctcgaccgttatactcccacccacacataccagcaaagtcttatgtctcctgtcgggcttcacctatgggaacgtgccct • You can use a list of accession numbers if you already know that the sequences are of similar lengths.
Multiple alignment method • Find homologous sequences • Place the sequences in a relevant format (usually FASTA), and edit to similar length. • 3. Run a multiple alignment program • ClustalW- oldest, flexible, robust • ClustalΩ- latest version, scalable, more accurate with addition of HMM • MUSCLE - fast, good for finding short motifs in small datasets • PRALINE - includes secondary structure information • T-Coffee - good for small datasets of shorter sequences, has a module for checking input seqs against the PDB • COBALT - uses domain conservation information (from BLAST page) which by definition has some structural information
Clustal family • Clustal X Uses progressive global alignment algorithm Graphic user interface only • Clustal W and W2 Command line tool, W2 also had a web interface Has a parallelized version, to cope with larger datasets • ClustalΩ HMM searches added to algorithm Command line and web interface Scalable to very large datasets
Input of Data • 3 or more sequences are needed, nucleic or amino acids, several formats are accepted: eg FASTA text files • Remove any white space or empty lines • The analysis will fail if two sequences have the same name • Can copy/paste sequences into Clustal or upload a txt file
Sanger Sequencing • DNA is fragmented • Cloned to a plasmid vector • Cyclic sequencing reaction • Separation by electrophoresis • First generation: S35 isotope • Second generation: fluorescent tags • Third generation: automation
“Next” generation sequencing • Not Sanger based biochemistry • DNA is fragmented and sequenced directly • Much more chemistry, physics, and computer science • Characterized by • Parallel Sequencing • High Throughput • Cost • Generates billions of sequences per experiment, highly computational
Human Genome Project ENCODE project HapMap project SNP consortium Individual human genomes James Watson, Craig Venter, 3 asian gentlemen
Workflow (Illumina) Input sequence is cleaved in reads of 100-150 bp, ligated to generic adaptors and annealed to a slide using the adaptors. These nucleotides are fluorescently labeled, with the color corresponding to the base. They also have a terminator, so that only one base is added at a time. These nucleotides are fluorescently labeled, with the color corresponding to the base. They also have a terminator, so that only one base is added at a time.
Workflow (Illumina) • An image is taken of the slide. In each read location, there will be a fluorescent signal indicating the base that has been added. • The terminators are removed, allowing the next base to be added, and the fluorescent signal is removed, preventing the signal from contaminating the next image
Data output • Different technologies, though essentially similar • Differences in number, length and quality of the sequences • and in format of output • Making data handling across platforms and analysis a challenge
Step 4b. Read mapping • Align your sequences onto a reference genome or other region • (unless you are de novo sequencing a new species) • Determine the coordinates and annotations if known GAAACACAAAGTG GTCTAGGGAAGAAGG ATCTTGTAGG .. TAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCTAGGGAAGAAGGTGTGTGACCAGGGAGGTCCC .. Genome
Sequence Coverage Deep coverage = more accurate results Deep coverage => detecting variants Deep coverage = more expensive!
It’s all about the alignment Meyerson et al. Nature Reviews Genetics11, 685-696 (October 2010)
NGS pipeline Sequence Alignment/Map Binary Alignment/Map Variant Call Format from Dolled-Filhart et al, 2013
NGS File formats • Raw data from various vendors => various formats (FASTQ) • Different quality metrics (some more stringent than others) • As data analysis proceeds, end up with even more formats: • GenBank formats (SRA) • Alignments are in SAM/BAM • Genome Browser formats (wig, bed, gff, etc) • Variants in vcf files (SNPs, indels, etc)
FASTQ format: sequence + quality sequence identifier header 1 2 raw sequence sign 3 optional description 4 quality string c.f. S. Brown NYU @SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152 NTCTTTTTCTTTCCTCTTTTGCCAACTTCAGCTAAATAGGAGCTACACTGATTAGGCAGAAACTTGATTAACAGGGCTTAAGGTAACCTTGTTGTAGGCCGTTTTGTAGCACTCAAAGCAATTGGTACCTCAACTGCAAAAGTCCTTGGCCC +SRR350953.5 MENDEL_0047_FC62MN8AAXX:1:1:1646:938 length=152 +50000222C@@@@@22::::8888898989::::::<<<:<<<<<<:<<<<::<<:::::<<<<<:<:<<<IIIIIGFEEGGGGGGGII@IGDGBGGGGGGDDIIGIIEGIGG>GGGGGGDGGGGGIIHIIBIIIGIIIHIIIIGII
Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same c.f. S. Brown NYU MCL-SRR350952.1 99 chr13 28330526 29 76M = 28330636 183 NAAAGACACAGTTACATGAAGAACATACTCCTCTCTCAGACTGCCCAGGTTCAGTGATTCATTCAACAAACTTTAT #,())*3--.@@@@@@@@@@<<<::87999<<<<<@@@@@@@:@@<:<<<<<8<<::7::::::::@@@22::::@ XT:A:U NM:i:1 SM:i:29 AM:i:29 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0T75 XA:Z:chr13,+28330526,76M,1; MCL-SRR350952.1 147 chr13 28330636 29 73M3S = 28330526 -183 TATAGGATTCAACTGTGAGAAAGACATATTAATCTCTTCCATTGTGCAGACTACATTCTTTTTTTTTTTTTTTGAG #####################B7:?2,?8;;@+=+A@3DEBB?2?5B7=A?=?4<8;;AIGIDIHFAIIGHIIBII XT:A:M NM:i:2 SM:i:29 AM:i:29 XM:i:2 XO:i:0 XG:i:0 MD:Z:18A18G35 MCL-SRR350952.2 99 chr9 16437227 60 76M = 16437304 153 NGAAATGCAAGGCTGTTTGGGATGTTTTCGAAGTGATGAATGCTGGAAGGATTGCTGTTCTCTAAGTGAGCAAGGA ############################################################################ XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0A75 XA:Z:chr9,+16437227,76M,1; MCL-SRR350952.2 147 chr9 16437304 60 76M = 16437227 -153 GCTGGGACTCCTGGTGCGATTATTGCTCTCAATGAAAGTCCTTATATCTGAGTCTGTCTTTGAAGATGGTACAGCC DBAEA<>G>GGDGG<DD3E<CDCC>E?D?E@DGDDDG8BGGGEGEG@E@@BCCFEE,IHIIGEGGEFCCF<FFFFD XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XA:Z:chr9,-16437304,76M,0;
Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same Coor12345678901234 5678901234567890123456789012345 Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT header seq info @HD VN:1.5 SO: coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 c.f. S. Brown NYU
Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same Coor12345678901234 5678901234567890123456789012345 Ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT @HD VN:1.5 SO: coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 Query template NAME POSition QUERY Sequence bitwise FLAG CIGAR MateReference NaMe query QUAL MAPing quality Inferred Insert SIZE Reference seq. NAME c.f. S. Brown NYU
Sequence Alignment Map standard file format for sequences aligned to reference genomeBAM is the compressed binary version of the same @HD VN:1.5 SO: coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1 bitwise FLAG CIGAR c.f. S. Brown NYU
VCF format: variants c.f. S. Brown NYU chr1 901559 . G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15 chr1 905974 . C T 187 . DP=22;VDB=0.0249;AF1=0.5;AC1=1;DP4=5,4,9,3;MQ=59;FQ=170;PV4=0.4,0.2,0.2,0.49 GT:PL :GQ 0/1:217,0,197:99 chr1 907622 . G C 82.3 . DP=8;VDB=0.0474;AF1=1;AC1=2;DP4=0,0,2,3;MQ=54;FQ=-42 GT:PL:GQ 1/1:115,15,0: 27 chr1 909073 . C T 39 . DP=16;VDB=0.0504;AF1=0.5;AC1=1;DP4=0,9,0,7;MQ=59;FQ=42;PV4=1,7.5e-09,0.14,0.49 GT:PL :GQ 0/1:69,0,156:72 chr1 909238 . G C 115 . DP=10;VDB=0.0503;AF1=1;AC1=2;DP4=0,0,4,4;MQ=60;FQ=-51 GT:PL:GQ 1/1:148,24,0: 45 chr1 909309 . T C 3.01 . DP=6;VDB=0.0018;AF1=0.4998;AC1=1;DP4=0,2,0,3;MQ=60;FQ=4.72;PV4=1,0.02,1,0.0071 GT:PL :GQ 0/1:30,0,47:28 chr1 909768 . A G 137 . DP=21;VDB=0.0530;AF1=1;AC1=2;DP4=0,0,11,9;MQ=60;FQ=-87 GT:PL:GQ 1/1:170,60,0: 99 chr1 935222 . C A 4.61 . DP=4;VDB=0.0216;AF1=1;AC1=2;DP4=0,0,0,2;MQ=60;FQ=-33 GT:PL:GQ 1/1:34,6,0:5
VCF format: variants ALT QUAL #CHROM POS ID REF QUAL INFO AA ancestral allele AC allele count in genotypes, for each ALT allele, in the same order as listed AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes AN total number of alleles in called genotypes BQ RMS base quality at this position CIGAR cigar string describing how to align an alternate allele to the reference allele DB dbSNP membership DP combined depth across samples, e.g. DP=154 END end position of the variant described in this record (esp. for CNVs) H2 membership in hapmap2 MQ RMS mapping quality, e.g. MQ=52 MQ0 Number of MAPQ == 0 reads covering this record NS Number of samples with data SB strand bias at this position SOMATIC indicates that the record is a somatic mutation, for cancer genomics VALIDATED validated by follow-up experiment c.f. S. Brown NYU chr1 901559 . G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15
Alignment or NGS What are the challenges? c.f. S. Brown NYU
NGS alignment algorithms Enter BWT Smith Waterman BLAST BLAT is precomputed BLAST