1 / 23

Indel Mappers

Indel Mappers. Indel Mapper. Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler and Zemin Ning. The programs.

trevet
Download Presentation

Indel Mappers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indel Mappers

  2. IndelMapper • Pindel – A Pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads • Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler and ZeminNing

  3. The programs • Stampy – A statistical algorithm for sensitive and fast mapping of Illumina sequence reads • GertonLunter and Martin Goodson (Gen Res Oct 2010) • Last - Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection • Michiaki Hamada, Edward Wijaya, Martin C. Frith and Kiyoshi Asai (Bioinformatics Oct 2011)

  4. Flow of PIndel • Aim: Compute precise breakpoints as well as the fragments inserted or deleted compared to the reference from paired-end reads • Use SSAHA2 to map all reads to reference • If both ends are uniquely mapped, Keep them • If one end is uniquely mapped (no mismatch allowed for this anchoring end) • Other end must be mapped with a threshold of at least 20 (alignment score for ~36bp read)

  5. Finding the unmapped end • Given a unique anchor of one end, find the locus of its unmapped pair and its fragments • 2 fragments if it is a deletion • 3 fragments if it is an insertion

  6. Finding the unmapped end • Due to an deletion (must be supported >=2 reads) • User specify Max. delete size, Min_F & Min_C

  7. Finding the unmapped end • Due to insertion (<=20bp for 36bp reads) • Insertion must be supported by >=2 reads • Compute min&max unique substrings (US) of both 5’&3’ ends of the unmapped read • Check if minUS_5’ is adjacent with maxUS_3’ and vice versa • The region between minUS_5’ and minUS_3’ is the inserted fragment

  8. Outline of Stampy • Scanning the read • Phred scores • Similarity Filtering • Single End - Mapping Posterior • Paired-end reads: paired-end candidates

  9. Scanning the read • Overlapping 15mers considered • Including 1-mismatch ‘neighbours’ • For reads >34bp and <50bp long • 1-mismatch neighbours are considered for half of the 15mers • reads >=50bp long, only a-third of the 15mers are considered

  10. Phred scores • Corresponding positions of the read are marked with a Phred score • 0, if it is repetitive (>200 occurrences in the reference); for its 1-neighbor, it is marked by the Phred quality of the mutated base • All positions of non-repetitive 15mers are retrieved • The scores are used to calculate the mapping posterior later

  11. Similarity Filtering • Three 4-nt words close to but non-overlapping with the 15mer are chosen • Counts of A-C-G-T for these 12 read-positions • Counts of A-C-G-T for these 12 positions at the putative genomic location • Get the absolute difference between the two sets of counts (read and reference); Score T • Candidate positions exceeding T will be discarded

  12. Single End - Mapping Posterior • Probability that a mapping is incorrect • Lopt is max likelihood mapping location • The sum runs over all considered locations • This is only an approximate as correct location is not considered among all Li • Read contains highly repetitive 15mers • Low quality or highly diverged from reference • Sequence is not represented in reference • Final mapping Phred score is summing 1, 2, 3 1 ‐ P(read | Lopt ) / Σ P(read | Li )

  13. Paired-end reads: paired-end candidates • Pair is unmapped if no candidates found for both reads • Report the pair-coordinates • Best locations for pair are with 4sd of mean insert-length OR • Phred score >=2 in (1 & 2) • Else • Candidates which constitute 99.9% of the posterior mapping score of the single read are extracted • Its mate will be mapped against the reference implied by the insert-length distribution

  14. Paired-end reads: paired-end candidates • Final mapping quality • Product of the top-scoring single-end hits selected as the pair • Or Single-end posterior score of anchoring hit

  15. LAST • Uses probabilistic alignment instead of maximum score-based alignment • Based on posterior decoding technique which uses marginal probabilities that incorporate all possible alignments with quality scores

  16. Outline of LAST • Incorporating quality scores into a score matrix • Probabilistic model for alignment • Marginal Probabilities • Probabilistic alignments with quality scores • Y-centroid alignment • LAMA alignment

  17. Incorporating quality scores into a score matrix • Old Method: Sa,b is the substitution score of aligning nucleotide reference-a onto read-b • Incorporate quality-score, q, into S • T is a scaling factor

  18. Probabilistic model for alignment • Let S(A) be the score for alignment A. • For a local alignment A, the probability of A • x is the genome region • y is the read-base with a quality score • S(A) is computed from the ‘new’ substitution score matrix

  19. Marginal Probabilities • Pik is the marginal probability that a base xi (i-th base of x) aligns with a base yk (k-th base of y) • qi is the marginal probability that a base xi aligns with a gap • Ui is the marginal probability that xi belongs to an un-aligned region that is not contained in the local alignment

  20. Probabilistic alignments with quality scores • Two methods considering quality scores by using the marginal probabilities • Y-centroid alignment • LAMA alignment

  21. Y-centroid alignment • Maximizing S(A) for alignment A • Y is a parametric input • xi~yk is an aligned column (without gaps) in A • Computed from NW algorithm

  22. Parameter Y • Adjusts the sensitivity and precision of the aligned columns • When Y is low, LAST is conservative and only align bases with high probabilities • When Y is high, increases rate of alignments at the cost of more false-positives • Y-centroid is bad • Even with a low-Y, LAST may still contain many gaps

  23. LAMA alignment • Consider the aligned and gap explicitly • For the gaps Deletion in alignment Insertion in alignment

More Related