1 / 60

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 3 Mapping and Genome Rearrangement. Jared Simpson, Ph.D. Bioinformatics for Cancer Genomics May 25-29, 2015. Paired-end Reads. DNA fragment. ATCAA. CTAAG. Learning Objectives of Module.

moeller
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 3 Mapping and Genome Rearrangement Jared Simpson, Ph.D. Bioinformatics for Cancer Genomics May 25-29, 2015 Paired-end Reads DNA fragment ATCAA CTAAG

  4. Learning Objectives of Module • Understand mapping reads to a reference genome • Understand FASTQ and SAM/BAM file formats • Learn common terminology used to describe alignments • Learn how to find genome rearrangements using read pairs • Run a mapper and rearrangement caller

  5. Sequencing platforms 14TB/run $ 600Gb/10d 100Gb/15d 120Gb/1d 90Gb/10d Increasing Data Per Run 150Mb/3h 2Gb/27h 700Mb/23h $ 100Mb/1h Increasing Run Time

  6. Illumina Sequencing

  7. Basecalling • Translation of image data to base calls

  8. Sources of errorIllumina: Pre-phasing & Phasing

  9. What is a base quality score? • Phred quality scores: • Estimate of probability the base call is incorrect

  10. Error Profiles • Illumina • Low error rate (~0.5%), mainly substitutions • 454/Ion Torrent • Mainly insertions/deletions in homopolymer runs • Pacbio • Higher error rate, mixture of insertions, deletions, substitutions

  11. Illumina Error Profile

  12. FASTQ files reads.fastq Read ID

  13. FASTQ files reads.fastq Sequence

  14. FASTQ files reads.fastq Quality ID

  15. FASTQ files reads.fastq Quality

  16. Reference Mapping Goal: find out where in the genome the read came from Issues: the human genome is large and repetitive NGS instruments produce huge amounts of data the sequenced genome will differ from the reference due to SNPs, indels and structural variation

  17. Choosing a Mapper Needs to be accurate Misaligned reads are a source of false positive variant calls Needs to be sensitive Must allow for differences between the individual and reference Needs to be fast Informatics cost of NGS analysis is significant

  18. Reference Mapping Reference genome Sequence read ?

  19. Reference Mapping Reference genome x x x Sequence read

  20. Mapping Quality • Phred-scaled estimate of the probability that the chosen mapping is wrong • 1 in 1000 reads with “Q30” alignment will be placed incorrectly • What makes accurate mapping difficult? • Short reads • High error rate • Repetitive sequence

  21. What are Paired Reads? Paired-end Reads DNA fragment ATCAAGA CTACATG Insert size (IS) Slides by M. Brudno

  22. Paired Reads Reference genome ? Sequence read pair

  23. Paired Mapping Reference genome x x Sequence read pair

  24. Paired Mapping Reference genome x x x x x x x x Sequence read pair

  25. Sequence Alignment/Map Format • SAM/BAM is the standardized format for working with alignments • SAM is tab-delimited text representation • BAM is a compressed binary representation SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77

  26. SAM Format SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 Flag Read ID • Flag indicates the reference strand, pairing information

  27. SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 Chromosome Position

  28. SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 Mapping Quality

  29. SAM Description SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 CIGAR Ref ACGATACATAC Ref GACA-AACC Read ACGA-ACATAC Read GTCATAACC CIGAR: 4M1D6M CIGAR: 4M1I4M

  30. SAM Description Mate chromosome, position Insert size SRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGTGTGCAATAGACTTAT #>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>8AB685C26091:77 ATCAA CTAAG Insert size (IS)

  31. Resources samtools: toolkit for working with SAM/BAM files Convert between SAM/BAM Sort alignments Extract alignments for a given genomic location SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf Questions/Help https://lists.sourceforge.net/lists/listinfo/samtools-help http://www.biostars.org/ http://seqanswers.com/

  32. Viewing Alignments - IGV

  33. Alignment Problems

  34. Alignment Problems

  35. Alignment Problems

  36. We are now going to start a read mapping exercise

  37. We are on a Coffee Break & Networking Session

  38. Canadian Bioinformatics Workshops www.bioinformatics.ca

  39. Module #: Title of Module 40

  40. Module 3 Mapping and Genome Rearrangement Jared Simpson, Ph.D. Bioinformatics for Cancer Genomics May 25-29, 2015 Paired-end Reads DNA fragment ATCAA CTAAG

  41. What kinds of variation is there? Single Nucleotide Variants (SNVs) Short indels (<read length) Structural variations Large insertions and deletions Inversions Translocations Copy number variation

  42. Structural variants Mate-pair and paired-end reads can be used to detect structural variants Genomic DNA Paired-Ends 200 – 500bp Fragmentation Add amplification and sequencing adaptors Sequence

  43. Structural variants Mate-pair and paired-end reads can be used to detect structural variants Genomic DNA Mate-Pairs Fragmentation & circularization to an internal adaptor 1 - 20kb Shear Isolate internal adaptors and fragment ends Add amplification and sequencing adaptors Sequence

  44. Read pair orientation Reference genome Sequence read pair • The expected orientation is one read on the forward strand and one read on the reverse strand for paired-end reads

  45. Read pair alignment Fragment number • Fragment/insert size is determined by library preparation • Pairs that match the expected orientation and distance are called concordant • Discordant read pairs give evidence of structural variation Fragment size

  46. SV Signatures: Deletion don ref Slides by M. Brudno

  47. SV Signatures: Deletion don ref Deletion signature: mapped insert size larger than expected Slides by M. Brudno

  48. SV Signatures: Insertion don ref Insertion signature: mapped insert size smaller than expected Slides by M. Brudno

  49. SV Signatures: Tandem Duplication don ref Tandem duplication signature: wrong orientation

More Related