200 likes | 370 Views
Hobbes. A. Ahmadi et al, “Hobbes: optimized gram-based methods for efficient read alignment”, Nucleic Acids Research 2011 J. Kim et al, “Improving read mapping using additional prefix grams”, BMC Bioinformatics 2014 Hobbes software package, http://hobbes.ics.uci.edu. 전북대학교 컴퓨터공학부 김종익.
E N D
Hobbes • A. Ahmadi et al, “Hobbes: optimized gram-based methods for efficient read alignment”, Nucleic Acids Research 2011 • J. Kim et al, “Improving read mapping using additional prefix grams”, BMC Bioinformatics 2014 • Hobbes software package, http://hobbes.ics.uci.edu 전북대학교 컴퓨터공학부 김종익
Table of Contents • Genome, read, and read mapping • Distance measure • Overview of gram based approaches • Optimized gram selection • Refinement techniques • Bit vector approach • Additional prefix gram approach • Handling indels and verification • Implementation Details • Experimental Results
Genome, read, read mapping • In the viewpoint of computation, genome is a sequence of A,C,G,T • For example: ... AAAACAGCAACTGATATCAGAA … (C. Elegans - a worm) • A read is a short (35bp ~ 150bp) fragment of a genome generated by a NGS (next generation sequencing) sequencer • Read mapping (or read alignment) problem • Given a reference sequence g and a read r, find all locations (the best location/up to k locations/uniquely mapped location) in g where r appears. • Consideration: finding an exact mapping location of a read may not be possible due to sequencing errors and/or some genetic variations. Reference sequence ... AAAACAGCAACTGATATCAGAA … ATATCAGCA Read
Distance measure Hamming distance between two strings • The number of substituted characters • H(r, s) = 2 Edit distance between two strings • The number of insertion, deletion, and substitution of characters • ED(r, s) = 3 r = ACCCGT s = AGCCCT r = A CCGTA s = AGCC TC
Overview of gram based approaches (1/2) q-gram: a subsequence of genome/read of length q e.g. s= TGCCCTA 3-grams of s, denoted by G(s), is {TGC, GCC, CCC, CCT, CTA} Count filtering:H(r, s) ≤ d |G(r) ∩ G(s)| ≥ T = max(|G(r)|, |G(s)|) – d * q d substitutions could affect (or mutate) d *qgrams u n i v e r s a l 2 substitutions mutates 2 * 2 bi-grams
Overview of gram based approaches (2/2) In our problem, H(r, s) ≤ d |G(r) ∩ G(s)| ≥ T = |r| – q + 1 – d * q 105 112 118 490 Reference … ACGGT CGGTC ACCCT 5-grams … … … CGGTC: 106 – 1 = 105 ACGGT: 105 – 0 = 105 118 – 0 = 118 ACGGT: 105 – 13 = 92 118 – 13 = 105 112490 105 118 106 Inverted lists … … … Read ACGGTCTTCCCTACGGT Location 105 is a candidate Hamming distance ≤ 2 T = 17 – 5 + 1 – 2*5 = 3
Optimized gram selection (1/2) Consider we select d non-overlapping q-grams from a read r. If a genome subsequence s does not have any of selected q-grams, then H(r, s) > d e.g. ACGGTCTTCCCTACGGT Given a read r with a Hamming distance threshold d, select d + 1 non-overlapping q-grams from r. find those genome subsequences that contains at least 1 selected q-gram of r. ACGGT CCCTA … … 105 118 113 candidate locations = {105– 0, 118 – 0, 113 – 8} = {105, 118} … …
Optimized gram selection (2/2) To minimize the number of candidate locations, we need to select d + 1 non-overlappinggrams from a read such that the union size of inverted lists of the selected grams is minimized.How do we know the union size of inverted lists without performing an actual union operation? Assumption: the union size of inverted lists is proportional to the sum of the sizes of the inverted lists. (d – i – 1) * q |G(r)| –d*q q * (i – 1) + j … … … … j Recurrence GoalM(d + 1, |G(r)| –d*q) M(i, j – 1) M(i– 1, j) + L[j + (i – 1)*q].len j M(i, j) = min select i - 1 grams select d - i - 1 grams select 1 gram (1≤ i≤ d + 1 and 1 ≤j≤ |G(r)| –d*q) q
*Dynamic Programming Example: 5-grams and a hamming distance threshold is 2
Refinement techniques (1/4)Bit vector based approach (Hobbes v. 1.x) Bit vector (default size = 16) construction use 1 bit per character (e.g. A,T = 1 and C,G = 0) for each q-gram in a reference sequence,encode left 8 characters and right 8 characters of the q-gram to a bit vector for each position in an inverted list, attach a bit vector (default size = 16 bit) Pruning using bit vectors make a bit vector for a read shift away the bits of the matching q-gram build a mask to remove invalid bits count the number of different bits between bit vectors of a candidate and the read
Refinement techniques (2/4)Additional prefix q-gram based approach (Hobbes 2) Given a read r with a Hamming distance threshold d, select d + 2 non-overlapping q-grams from r. find those genome subsequences that contains at least 2 selected q-gram of r. IN(g1) IN(g2) Union of pairwise intersections of inverted lists IN(g1) IN(g2) Intuition Questions E A F D B C • How do we select optimald + 2non-overlapping q-grams? • How about using d + 3 or d + 4 q-grams instead of d + 2? IN(g3) If we use an additional q-gram g3, it plays an important role of filtering candidates generated by taking union of inverted lists of g1 and g2.
Refinement techniques (3/4)Additional prefix q-gram based approach (Hobbes 2) Selecting optimald + 2 non-overlapping q-grams • Two assumptions • The union size of inverted lists is proportional to the sum of the sizes of the inverted lists. • Locations in an inverted list are independently distributed and uniformly random. Probability of the occurrence of a q-gram g in a subsequence: by the second assumption Therefore, Our goal is to minimize Dropping the second term still gives us an upper bound of the number of candidates. Hence, our simplified problem is to minimize Reuse the recurrence and compute !!!
Refinement techniques (3/4)Additional prefix q-gram based approach (Hobbes 2) Generalization: generating candidates using ‘d + c’ non-overlapping prefix q-grams We observed that only the first additional prefix q-gram brings a substantial improvement and the effects of additional q-grams other than the first one are not significant. As we increase c, we may lose chances to select low frequency q-grams and as a result, the cost of scanning inverted lists would eventually outweigh the savings from reducing the number of candidate locations. Moreover, supporting indelswill be more complicated. Thus, we believe that it is not worthwhile to try to find an optimal c value in practice.
Handling Indels and Verification Indels (insertion or deletion operations) can cause the following problems. (a) indel between matched q-grams Read We allow gaps between two matched q-grams up to the edit distance threshold. That is, we treat two locations appearing only once as candidate locations if their difference is within the edit distance threshold. 0 1 2 3 4 5 6 7 9 10 11 12 13 14 … 8 Genome (b) deletion before matched q-grams Read Using a banded semi-global alignment algorithm verification window of [a, d] verification window of [l, d] + (optionally) verification window of [a, c] 0 3 4 5 6 7 8 9 10 11 12 13 14 … 1 2 Genome (c) insertion before matched q-grams Read Read k k k 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … a l b c d Genome Genome
Implementation Details (1/2) The effect of the gram length small large q = ? If we use large q, we reduce sizes of inverted lists and as a result we generate fewer candidates. As we increase q size, however, the number of different grams grows sharply and we need much time to look up an inverted index to find an inverted list of each gram. Increasing q size also affects selecting good grams. H/W acceleration Other filtering techniques • Cache prefetching of verification data • CPU popcount instruction • Letter count filter • Chunk-by-chunk comparison(for Hamming distance)
Implementation Details (2/2) The structure of the Hobbes software package (http://hobbes.ics.uci.edu) search/searcher.cc hobbes.cc main create a runner object set parameter values in runner.conf runner.run() searcher.findMappings() call either findAllMappings() or findKMappings() gather statistic information hobbesconf.cc search/singleendsearcher.cc searcher.findAllMappings() figure out the query type (forward, reverse, or both) call fillRead(), which determines prefix grams using DP call mergeAndVerifyAll() search/searcher.cc hobbesrunner.cc runner.run() create an input stream for the read file(s) - record the length of the first read. load the reference sequence into memory. load the index into memory. create progress indicator create a thread for input - input queue is protected by a mutex create a thread for output - output queue is protected by a mutex create threads for searchers - searchworker.run(), which calls searcher.findMappings() for each read search/singleendsearcher.cc search/singleendsearcher.cc searcher.mergeAndVerifyAll () call merger.findCandidateAll(), which generates candidates by merging the inverted lists of prefix grams call verifier.verifyCandidates(), which verifies candidates search/singleendmerger.ccsearch/singleendmergerplus.cc search/singleendverifier.cc thread/searchworker.cc Hobbes determines a filtering scheme (either bit vectors or additional prefix)using the read length and distance threshold.It calls different findCandidateAll functions based on its filtering scheme.
Experimental Results (1/2) Experiments on simulated data
Experimental Results (2/2) Experiments on real data
Conclusions We developed a fast and accurate read mapper !!!