410 likes | 427 Views
Explore biases in NGS technology, from sequencing errors to mappability bias, and learn about quality control and error correction methods like trimming and read mapping. Discover useful tools for addressing known biases. Improve your NGS data analysis!
E N D
Mapping & Tools Katia Khrameeva Research scientist, Skoltech
Plan • Biases in NGS technology • Quality control • Error correction • Read mapping • Useful tools
Known biases in NGS technology Sequencing errors at 3'-end of reads
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads G -> T and A -> C errors
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Sequences preceding errors are G-rich G -> T and A -> C errors
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Higher coverage of the 3'-end
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Higher coverage of the 3'-end Contamination by under-spliced RNAs
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Influence of RNA secondary structure Higher coverage of the 3'-end Contamination by under-spliced RNAs
Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Coverage non-uniformity across transcripts Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Influence of RNA secondary structure Higher coverage of the 3'-end Contamination by under-spliced RNAs
FASTQ format Raw sequence • FASTQ format is a standard format for storing reads. • Illumina sequence identifier: Quality values https://en.wikipedia.org/wiki/FASTQ_format
FASTQ format • Quality values are encoded as ASCII symbols. The range of values depends on the technology: • A quality value Q(Phred Quality Score) is an integer mapping of p(the probability that the corresponding base call is incorrect): https://en.wikipedia.org/wiki/FASTQ_format
Quality control FastQC program is the most popular tool to check quality of reads. • Number of reads • Quality of reads • Nucleotide content • Percent of duplicated reads • Overrepresented sequences
FastQC output Looks like the base calls after 50 have bad quality. The best solution is trimming.
FastQC output The presence of a spike around 40% indicates that sequences with 40% GC-content are over-represented.
FastQC output • Overrepresented sequences seem to consist of repeated runs of “AGAGA” – which has 40% GC-content. That probably explains the a spike around 40% at the GC plot. • Possible reasons: • Poor sequencing quality or bad base calling. • Sequencing library is contaminated with PCR-amplified microsatellite sequences. • The studied species has a lot of repetitive sequences in its genome.
Error correction • Trimming • Remove adapters • Trim low-quality nucleotides at the 3’-end of reads • Trimmomatic software • Correction of errors • Algorithms of error correction are based on k-mer counts: ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATTTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG • Bloocoo, Lighter software
Read mapping • The genome of the sequenced organism is known (‘reference genome’). • D. melanogaster, H. sapiens, S. cerevisiae, … Quality control & Error correction FASTQ file FASTQ file SAM/BAM file Read mapping
Read mapping • Aim: to find coordinates of reads in the reference genome. • Challenges: • Millions of short sequences • Sequences are often paired • Errors are not randomly distributed • Most popular programs are bowtie and bwa (both use Burrows-Wheeler Transform algorithm). Two-step approach: • Create an index for the reference genome (one time for one genome). • Map reads to the reference genome using this index.
bowtie • The first version of bowtie[Langmead et al. 2009] is optimal for: • short reads (under50 bp) • reads without indels (insertions/deletions) • bowtie2 [Langmead &Salzberg2012] is optimal for: • long reads (more than 50 bp) • reads with indels • various alignment options • Each version has its own index file format (bowtie-build/bowtie2-build tools). • A popular RNA-seq analysis toolset (tophat, cufflinks) is based on bowtie / bowtie2. http://bowtie-bio.sourceforge.net
bwa • bwabacktrack[Li, Durbin 2009]: • for short reads (< 100bp) • bwabwasw[Li, Durbin 2010]: • for long reads (70bp - 1Mbp) • short indels • bwamem[Li2013]: • for long reads (70bp-1Mbp) • faster and more efficient http://bio-bwa.sourceforge.net
SAM/BAM format • Suppose we have an alignment: • The corresponding SAM format is: CIGAR string http://samtools.github.io/hts-specs/SAMv1.pdf
SAM/BAM format • Each line represents the alignment of one read. • Each line has 11 mandatory fields: • FLAG: next slide. • MAPQ: MAPping Quality. It equals -10 log10Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
FLAG field Paired-end reads Multiple mapped read, and this is not the best alignment Chimeric alignment (different chromosomes) For example, FLAG = 99 is equivalent to (64 + 32 + 2 + 1).
SamTools package • A set of utilities for processing SAM/BAM files. • view Convert a bam file into a sam file. samtoolsview sample.bam > sample.sam Convert a sam file into a bam file. samtoolsview -bSsample.sam > sample.bam Extract all the reads aligned to the range specified. An index of the input file is required. samtoolsview sample_sorted.bam "chr1:10-13" The same reads as above, but instead of displaying, write the reads to a new bam file. samtoolsview -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam • sort samtoolssort unsorted_in.bamsorted_out • index samtoolsindex sorted.bam(creates an index file, sorted.bam.bai) -b: output BAM -S: read SAM http://samtools.sourceforge.net add a proper header
SamTools package • flagstat – report basic statistics samtoolsflagstatsample.bam An example of output: 4198456 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 4022089 + 0 mapped (95.80%:-nan%) 4198456 + 0 paired in sequencing2099228 + 0 read1 2099228 + 0 read2 3796446 + 0 properly paired (90.42%:-nan%)4013692 + 0 with itself and mate mapped8397 + 0 singletons (0.20%:-nan%)167574 + 0 with mate mapped to a different chr72008 + 0 with mate mapped to a different chr (mapQ>=5) • faidx– index a FASTA file samtoolsfaidxref.fasta(creates an index file ref.fasta.fai) • merge – merge several BAM files into one samtools merge out.bam in1.bam in2.bam
UCSC genome browser http://genome.ucsc.edu/
BedTools package • bamtobed - Convert BAM alignments to BED (& other) formats. • bamtofastq- Convert BAM records to FASTQ records. • bedtobam - Convert intervals to BAM records. • closest - Find the closest, potentially non-overlapping interval. • complement - Extract intervals _not_ represented by an interval file. • coverage - Compute the coverage over defined intervals. • genomecov - Compute the coverage over an entire genome. • getfasta - Use intervals to extract sequences from a FASTA file. • intersect - Find overlapping intervals in various ways. • shuffle- Randomly redistribute intervals in a genome. • sort - Order the intervals in a file. http://bedtools.readthedocs.org/
bedtools “intersect” • Report the intervals that represent overlaps between your two files: bedtoolsintersect -a cpg.bed -b exons.bed • Report the original feature in each file: bedtoolsintersect -a cpg.bed -b exons.bed -wa–wb • How many base pairs of overlap were there? bedtoolsintersect -a cpg.bed -b exons.bed–wo • Counting the number of overlapping features: bedtoolsintersect -a cpg.bed -b exons.bed–c • Find features that DO NOT overlap: bedtoolsintersect -a cpg.bed -b exons.bed -v
How to get help? • Read the manual! • Can’t find the answer? Read it again. • Use Google • Most likely your problem was encountered before, and solved by someone else. • Ask for help at specialized forums: • seqanswers.com – the most popular • biostars.com
Exercise • There are paired reads of some DNA sequencing experiment of the human sample: http://makarich.fbb.msu.ru/khrameeva/bioinf2015/ • You will study some particular region of the human genome • Map reads to the human reference genome (version hg19) • Extract reads that map to your region only • Upload the reads to UCSC genome browser as a custom track • Make a screenshot, count the number of insertions and deletions in SAM file • Send me: ekhrameeva@gmail.com E-mail subject: Practice2
How To:Mapping • Install bowtie2 program: http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.6/ • Download sequence of a chromosome your region is located at as a FASTA file http://genome.ucsc.edu/ Downloads Genome Data Human hg19 Data set by chromosome chr*.fa.gz • Decompress the file using gunzip • Map reads to this chromosome using bowtie2 with the standard parameters Don’t forget to make an index (bowtie-build2) of the chromosome before mapping! • You will get a SAM file as an output. • Count the number of insertions and deletions in SAM file (use cut)
How To:Extracting reads • Install BedTools: http://bedtools.readthedocs.org/en/latest/content/installation.html • Create a tab-delimited BED file with the coordinates of your region: chr21 1000000 2000000 • Convert SAM file with mapped reads to BAM file using samtools view • Use bedtools intersect to extract the reads from the BAM file You’ll need a BED file to upload the result to UCSC genome browser, so figure out how to make bedtools intersect to produce an output in BED format.
How To:UCSC custom track • Upload the BED file to UCSC genome browser ‘Add custom track’ button Choose file Submit