310 likes | 396 Views
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc. What is Read Alignment?. Subject’s Genome. Where do these match in the Reference?. AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC. AGGCTACGCAT G TCCCATAA T GACCCAC A CTTAAGTTC. Reference Genome.
E N D
Subject’s Genome Where do these match in the Reference? AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC Reference Genome Close but not quite the same as the Subject’s Genome
Exact Match With Mismatches With Gaps GCTACGCA CATAAAGAC CACTT_AGT AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC Reference Genome
Mismatches and Gaps Reference Genome Deletion Reads SNP
Short reads ~50, few mismatches and gaps Long reads, ~1000, many more mismatches and gaps
Separate handling for RNASeq No handling of adaptor trimming for small RNA BWA: Very few mismatches and gaps BowTie: only mismatches, no gaps CoBWeb BWA-SW: Many mismatches and gaps BowTie2 No paired read handling
For each read, scan the entire reference genome sequence SLOW!!!!
The Reference T C C G A C G Index the Reference C G A T T A C G A C
How can we find Exact Matches of a read quickly with this index?
The Reference T C C G A C G C G A T T A C G A C C C G
The Burrows-Wheeler based Index The Reference C G A C $ All its circular shifts, sorted lexicographically This column is the BWT A C $ C G C G A C $ C $ C G A G A C $ C $ C G A C The Index: now an array instead of a tree Sampled to reduce memory at the expense of speed (Ferragina and Manzini)
BWA, BWA-SW and BowTie force mismatches and gaps into the BW Index searching procedure
CoBWebuses the BW Index to find a ‘seed’ exact match and does Smith-Waterman around this seed This 15-mer occurs at locations x1, x2… This 15-mer occurs at locations x3, x4… This whole 30-mer occurs at location x5
Dynamic Programming • Given a location in the reference with an read anchor, how well does the read match here? Anchor 14 mer Reference Read • Smith-Waterman (optimized for large gaps)
Comparison with BWA BWA: 2 mismatches + 1 gap of possibly multiple length CoBWeb: 3 mismatches and 2 gaps Read Length 50 Read Length 150 20% faster than BWA with comparable results
Comparison with BWA-SW 8 mismatches plus 10 gaps Read Length 400 5650 mapped incorrecty by BWA-SW The remainder has poor BWA mapping quality
Avadis NGS Alignment, DNA Var Detection, RNASeq, ChIPSeq, Small RNASeq