1 / 41

NGS Biases and Error Correction: Mapping Tools and Techniques

Explore biases in NGS technology, from sequencing errors to mappability bias, and learn about quality control and error correction methods like trimming and read mapping. Discover useful tools for addressing known biases. Improve your NGS data analysis!

allenb
Download Presentation

NGS Biases and Error Correction: Mapping Tools and Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping & Tools Katia Khrameeva Research scientist, Skoltech

  2. Plan • Biases in NGS technology • Quality control • Error correction • Read mapping • Useful tools

  3. Known biases in NGS technology Sequencing errors at 3'-end of reads

  4. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads

  5. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads G -> T and A -> C errors

  6. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Sequences preceding errors are G-rich G -> T and A -> C errors

  7. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read

  8. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read

  9. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read

  10. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Higher coverage of the 3'-end

  11. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Higher coverage of the 3'-end Contamination by under-spliced RNAs

  12. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Influence of RNA secondary structure Higher coverage of the 3'-end Contamination by under-spliced RNAs

  13. Known biases in NGS technology CG-rich regions are higher covered Sequencing errors at 3'-end of reads Mappability bias PCR biases Coverage non-uniformity across transcripts Sequences preceding errors are G-rich G -> T and A -> C errors Nucleotide content bias across the read Influence of RNA secondary structure Higher coverage of the 3'-end Contamination by under-spliced RNAs

  14. Quality control and error correction!

  15. FASTQ format Raw sequence • FASTQ format is a standard format for storing reads. • Illumina sequence identifier: Quality values https://en.wikipedia.org/wiki/FASTQ_format

  16. FASTQ format • Quality values are encoded as ASCII symbols. The range of values depends on the technology: • A quality value Q(Phred Quality Score) is an integer mapping of p(the probability that the corresponding base call is incorrect): https://en.wikipedia.org/wiki/FASTQ_format

  17. Quality control FastQC program is the most popular tool to check quality of reads. • Number of reads • Quality of reads • Nucleotide content • Percent of duplicated reads • Overrepresented sequences

  18. FastQC output

  19. FastQC output Looks like the base calls after 50 have bad quality. The best solution is trimming.

  20. FastQC output The presence of a spike around 40% indicates that sequences with 40% GC-content are over-represented.

  21. FastQC output • Overrepresented sequences seem to consist of repeated runs of “AGAGA” – which has 40% GC-content. That probably explains the a spike around 40% at the GC plot. • Possible reasons: • Poor sequencing quality or bad base calling. • Sequencing library is contaminated with PCR-amplified microsatellite sequences. • The studied species has a lot of repetitive sequences in its genome.

  22. Error correction • Trimming • Remove adapters • Trim low-quality nucleotides at the 3’-end of reads • Trimmomatic software • Correction of errors • Algorithms of error correction are based on k-mer counts: ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATTTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG ATCGTAGCTATTCGATCTAGATGCATCG • Bloocoo, Lighter software

  23. Read mapping • The genome of the sequenced organism is known (‘reference genome’). • D. melanogaster, H. sapiens, S. cerevisiae, … Quality control & Error correction FASTQ file FASTQ file SAM/BAM file Read mapping

  24. Read mapping • Aim: to find coordinates of reads in the reference genome. • Challenges: • Millions of short sequences • Sequences are often paired • Errors are not randomly distributed • Most popular programs are bowtie and bwa (both use Burrows-Wheeler Transform algorithm). Two-step approach: • Create an index for the reference genome (one time for one genome). • Map reads to the reference genome using this index.

  25. bowtie • The first version of bowtie[Langmead et al. 2009] is optimal for: • short reads (under50 bp) • reads without indels (insertions/deletions) • bowtie2 [Langmead &Salzberg2012] is optimal for: • long reads (more than 50 bp) • reads with indels • various alignment options • Each version has its own index file format (bowtie-build/bowtie2-build tools). • A popular RNA-seq analysis toolset (tophat, cufflinks) is based on bowtie / bowtie2. http://bowtie-bio.sourceforge.net

  26. bwa • bwabacktrack[Li, Durbin 2009]: • for short reads (< 100bp) • bwabwasw[Li, Durbin 2010]: • for long reads (70bp - 1Mbp) • short indels • bwamem[Li2013]: • for long reads (70bp-1Mbp) • faster and more efficient http://bio-bwa.sourceforge.net

  27. SAM/BAM format • Suppose we have an alignment: • The corresponding SAM format is: CIGAR string http://samtools.github.io/hts-specs/SAMv1.pdf

  28. SAM/BAM format • Each line represents the alignment of one read. • Each line has 11 mandatory fields: • FLAG: next slide. • MAPQ: MAPping Quality. It equals -10 log10Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.

  29. FLAG field Paired-end reads Multiple mapped read, and this is not the best alignment Chimeric alignment (different chromosomes) For example, FLAG = 99 is equivalent to (64 + 32 + 2 + 1).

  30. SamTools package • A set of utilities for processing SAM/BAM files. • view Convert a bam file into a sam file. samtoolsview sample.bam > sample.sam Convert a sam file into a bam file. samtoolsview -bSsample.sam > sample.bam Extract all the reads aligned to the range specified. An index of the input file is required. samtoolsview sample_sorted.bam "chr1:10-13" The same reads as above, but instead of displaying, write the reads to a new bam file. samtoolsview -h -b sample_sorted.bam "chr1:10-13" > tiny_sorted.bam • sort samtoolssort unsorted_in.bamsorted_out • index samtoolsindex sorted.bam(creates an index file, sorted.bam.bai) -b: output BAM -S: read SAM http://samtools.sourceforge.net add a proper header

  31. SamTools package • flagstat – report basic statistics samtoolsflagstatsample.bam An example of output: 4198456 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 4022089 + 0 mapped (95.80%:-nan%) 4198456 + 0 paired in sequencing2099228 + 0 read1 2099228 + 0 read2 3796446 + 0 properly paired (90.42%:-nan%)4013692 + 0 with itself and mate mapped8397 + 0 singletons (0.20%:-nan%)167574 + 0 with mate mapped to a different chr72008 + 0 with mate mapped to a different chr (mapQ>=5) • faidx– index a FASTA file samtoolsfaidxref.fasta(creates an index file ref.fasta.fai) • merge – merge several BAM files into one samtools merge out.bam in1.bam in2.bam

  32. UCSC genome browser http://genome.ucsc.edu/

  33. BedTools package • bamtobed - Convert BAM alignments to BED (& other) formats. • bamtofastq- Convert BAM records to FASTQ records. • bedtobam - Convert intervals to BAM records. • closest - Find the closest, potentially non-overlapping interval. • complement - Extract intervals _not_ represented by an interval file. • coverage - Compute the coverage over defined intervals. • genomecov - Compute the coverage over an entire genome. • getfasta - Use intervals to extract sequences from a FASTA file. • intersect - Find overlapping intervals in various ways. • shuffle- Randomly redistribute intervals in a genome. • sort - Order the intervals in a file. http://bedtools.readthedocs.org/

  34. bedtools “intersect” • Report the intervals that represent overlaps between your two files: bedtoolsintersect -a cpg.bed -b exons.bed • Report the original feature in each file: bedtoolsintersect -a cpg.bed -b exons.bed -wa–wb • How many base pairs of overlap were there? bedtoolsintersect -a cpg.bed -b exons.bed–wo • Counting the number of overlapping features: bedtoolsintersect -a cpg.bed -b exons.bed–c • Find features that DO NOT overlap: bedtoolsintersect -a cpg.bed -b exons.bed -v

  35. How to get help? • Read the manual! • Can’t find the answer? Read it again. • Use Google  • Most likely your problem was encountered before, and solved by someone else. • Ask for help at specialized forums: • seqanswers.com – the most popular • biostars.com

  36. Exercise • There are paired reads of some DNA sequencing experiment of the human sample: http://makarich.fbb.msu.ru/khrameeva/bioinf2015/ • You will study some particular region of the human genome • Map reads to the human reference genome (version hg19) • Extract reads that map to your region only • Upload the reads to UCSC genome browser as a custom track • Make a screenshot, count the number of insertions and deletions in SAM file • Send me: ekhrameeva@gmail.com E-mail subject: Practice2

  37. How To:Mapping • Install bowtie2 program: http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.6/ • Download sequence of a chromosome your region is located at as a FASTA file http://genome.ucsc.edu/  Downloads  Genome Data  Human  hg19  Data set by chromosome  chr*.fa.gz • Decompress the file using gunzip • Map reads to this chromosome using bowtie2 with the standard parameters Don’t forget to make an index (bowtie-build2) of the chromosome before mapping! • You will get a SAM file as an output. • Count the number of insertions and deletions in SAM file (use cut)

  38. How To:Extracting reads • Install BedTools: http://bedtools.readthedocs.org/en/latest/content/installation.html • Create a tab-delimited BED file with the coordinates of your region: chr21 1000000 2000000 • Convert SAM file with mapped reads to BAM file using samtools view • Use bedtools intersect to extract the reads from the BAM file You’ll need a BED file to upload the result to UCSC genome browser, so figure out how to make bedtools intersect to produce an output in BED format.

  39. How To:UCSC custom track • Upload the BED file to UCSC genome browser ‘Add custom track’ button  Choose file  Submit

More Related