190 likes | 539 Views
Mapping NGS sequences to a reference genome. Why?. Resequencing studies (DNA) Structural variation SNP identification RNAseq Mapping transcripts to a genome sequence Genome annotation Transcript enumeration Identification of splice junctions/variants. Blast is too slow.
E N D
Why? • Resequencing studies (DNA) • Structural variation • SNP identification • RNAseq • Mapping transcripts to a genome sequence • Genome annotation • Transcript enumeration • Identification of splice junctions/variants
Blast is too slow • Different alignment algorithms are necessary • Burrows Wheeler Alignment • sequence database (genome) is transformed to produce an index • Individual sequence reads are searched against this index • STAR Aligner (Dobin et al. 2012) Bioinformatics • Uncompressed Suffix trees
Tophat2 • Based on the Bowtie alignment engine • Bowtie, matching with no gaps • Tophat2, gapped matches • Aligns reads to a Burrows Wheeler transformed index of the genome • 1st pass non-gapped matches • 2nd pass splits unmapped reads and attempts to align the fragments
The STAR Aligner • Start at the first base of sequence read • Find Maximal Mappable Prefix (MMP) • Repeat process using unmapped portion of read • 50x faster than other aligners
OUTPUTS • TopHat (Bowtie) • .bam file (binary alignment/map) • .sam (sequence alignment/map) • Single .sam file entry: I8MVR:53:837 0 17_dna:chromosome 14090858 255 21M * 00 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT
.sam flags • 1 • 2 • 1+2 • 0+4 • 1+4 • 0+2+4 • 1+2+4 • 0+8 • 1+8 • 0+2+8 • 1+2+8 • 0+4+8 • 1+4+8 • 0+2+4+8 • 1+2+4+8 • …etc.
CIGAR format I8MVR:104:144 0 7_dna:chromosome120102744 255 62M1I14M * 00 GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA
Quantifying alignments • How many reads overlap a given interval on a chromosome (scaffold)? • How do these regions correspond to known genes? • .gtf file • How many transcripts from my gene of interest? • How confident can I be about a variant call?
Annotate regions - GTF files • Score • Strand • Frame • Attribute GTF fields • Sequence ID • Source • Feature • Start • End
Variant Calling • .bam/.sam file contains all of the information required to call variants • Variant calls can’t be extracted from the .bam file • Must provide the genome sequence I8MVR:53:837 0 17_dna:chromosome 14090858 255 21M * 0 0 TAACTACGAATACCTGTCGAT **%-**,00%-*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT
Variant Analysis • Extract variant information from provided .bam file • Examine output file and learn about the information contained in the various fields
Introducing… Dr. Eric Rouchka • Bioinformatics Core Director • Department of Computer Engineering and Computer Science • University of Louisville • Kentucky Biomedical Research Infrastructure Network