310 likes | 455 Views
Biostatistics-Lecture 16 Sequence alignment based on Burrows-Wheeler Transformation. Ruibin Xi Peking University School of Mathematical Sciences. Burrows-Wheeler Transformation. BWT: ACGGTACA$ ($<A<C<G<T). Burrows-Wheeler Transformation. BWT: ACGGTACA$ ($<A<C<G<T).
E N D
Biostatistics-Lecture 16Sequence alignment based on Burrows-Wheeler Transformation Ruibin Xi Peking University School of Mathematical Sciences
Burrows-Wheeler Transformation • BWT: ACGGTACA$ ($<A<C<G<T)
Burrows-Wheeler Transformation • BWT: ACGGTACA$ ($<A<C<G<T)
Burrows-Wheeler Transformation • BWT: S=ACGGTACA$ ($<A<C<G<T) BWT(T) T
Burrows-Wheeler Transformation • Last-First Mapping (LF) BWT(T) T
Burrows-Wheeler Transformation • Last-First Mapping (LF) BWT(T) T
Burrows-Wheeler Transformation • Last-First Mapping (LF)
Burrows-Wheeler Transformation • Last-First Mapping (LF)
Burrows-Wheeler Transformation • Last-First Mapping (LF) • We may recover the original sequence using the LF mapping
Burrows-Wheeler Transformation • Last-First Mapping (LF) • We may recover the original sequence using the LF mapping
BWT via the suffix array • Suffix Array (SA)
BWT via the suffix array • Relationship of BWT and suffix array (0-based index)
BWT via the suffix array • Construction of the BWT by matrix rotation is slow • There are O(n) algorithms for constructing suffix array • We may construct the BWT via the suffix array
FM-index • C(c): # of occurrences of the characters {$,1,…,c-1} • 1=A, 2=C, 3=G,4=T • C(c) is the position of the first occurrence of c in F (the 1st column in BWM) • Occ(c,1,k): # of occurrences of c in BWT(T)[1:k]
FM-index • LF(k) = C(L[k]) + Occ(L[k],0,k)-1
Searching a pattern P using the FM-index • Note that any pattern P always occur contiguously in the BWM (e.g. AC)
Searching a pattern P using the FM-index • Note that any pattern P always occur contiguously in the BWM (e.g. ba)
Searching a pattern P using the FM-index • Note that any pattern P always occur contiguously in the BWM (e.g. ab)
Searching a pattern P using the FM-index • P = ACA • Suffix start with A
Searching a pattern P using the FM-index • P = ACA • Suffix start with A is at [sp,ep] = [C(A),C(C)-1]
Searching a pattern P using the FM-index • P = ACA • Suffix start with CA
Searching a pattern P using the FM-index • From the last step, the first A prefixed by C is at Occ(C,0,sp-1) in the A section, the last is Occ(C,0,ep)-1 in the A section
Searching a pattern P using the FM-index • Suffix start with CA must in the C section
Searching a pattern P using the FM-index • Suffix start with CA is in [sp,ep]=[Occ(C,0,sp-1)+C(C),Occ(C,0,ep)+C(C)-1]
Searching a pattern P using the FM-index • Suffix start with ACA must in the A section
Searching a pattern P using the FM-index • Suffix start with ACA is in [sp,ep]=[Occ(A,0,sp-1)+C(A), Occ(A,0,ep)+C(A)-1]
Searching a pattern P using the FM-index • Algorithm BW_Search(P[0,p-1]) • c=P[p-1],i=p-1; • sp = C[c], ep= C[c+1]-1; • while(sp≤ep and i≥1) do • c = P[i-1] • sp = C[c] + Occ(c,0,sp-1) • ep = C[c] + Occ(c,0,ep)-1; • i = i-1; • if (ep < sp) then return “not found” else return found (ep-sp+1) occurrences
Aligners Based on BWT • Bowtie • BWA