450 likes | 555 Views
Index-based search of single sequences. Omkar Mate CS 374 Stanford University. Motivation. A newly discovered gene …. (may hold clues to evolution of human brain capacity ). Occurrence in other species Mutation …. Sequence Alignment. New Query. Existing Genome Database. …………………………
E N D
Index-based search of single sequences Omkar Mate CS 374 Stanford University
Motivation A newly discovered gene … (may hold clues to evolution of human brain capacity ) • Occurrence in other species • Mutation • …
Sequence Alignment New Query Existing Genome Database ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… gacatta Easy??? Think again………..
State of Biological Databases ~300 sequenced genomes
Alignment Problem • Assume we try Smith-Waterman: [running time = O(MN)] The entire genomic database 1011 Our new gene 104 Time and Space complexity: O(1015) Huge number! But we can do better …
Indexing-based Local Alignment BLAST- Basic Local Alignment Search Tool Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman! query DB hits
BLAST Step 1: Construct dictionary of query words (Query indexed by all words of size, k = 4) Query: AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… Index of query words
BLAST • Step 2 – Generate all the relatives of a word (Relative: a word with alignment score greater than a threshold, T) Query Word: ATGC, T: 50 Candidates Score Relatives! Update the index … (Query: ) AATGCCGATAGCATCG …
BLAST • Step 3: Searching • Search through database linearly, one word at a time • Initiate alignment with all occurrences of that word in query Database: ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … Query: Index:
BLAST • Step 4 – Alignment Extension Once we find an alignment, extend to left and right with no gaps until alignment score falls below a certain threshold. ATGCCGATACGATCAGCTACGATCAG… ATGCCGATACGATCAGCTACGATCAG…
Sensitivity-Speed Tradeoff Longer words => Fewer alignments => Faster but Low chance of a match Shorter words => More alignments => High chance of a match but Slower
BLAT - The BLAST-Like Alignment Tool Similarities: • Rapid scans for relatively short matches (hits) • Extend these hits into high-scoring pairs (HSP) Differences: BLAST BLAT • Index of query sequence Index of database • Scan through database Scan through query sequence • Gaps not allowed in ext. Gaps allowed in extension • Returns smaller alignments Returns larger alignments
BLAT Strategies - I • Single perfect matches We do not allow any mismatch. Common intuition : fewer matches for longer words
Sensitivity and SpecificitySingle Perfect Nucleotide K-mer Matches as a Search Criterion
BLAT Strategies - II • Allow one mismatch Intuition: Higher number of matches for same word length => Better sensitivity (Caution: Keep k higher, else no. of matches will be huge)
Sensitivity and SpecificitySingle Near-Perfect Nucleotide K-mer Matches as a Search Criterion (one mismatch allowed)
BLAT Strategies - III Allow multiple perfect matches Two parameter: N: no. of matches, K: word size Practically: Same sensitivity, higher speed
Sensitivity and SpecificityMultiple Perfect Nucleotide K-mer Matches as a Search Criterion (2 and 3 perfect matches)
Seeded AlignmentA dominant paradigm for fast comparisons • Seed: A common pattern of positions used for efficient large scale comparison of genomic DNA …G A T T A C C A G A T T A C C A G A T T A … Seed: {0,1,2,3} => Comparison sequences: {G A T T} {A T T A} {T T A C} … Seed: {0,2,4,6} => Comparison sequences: {G T A C} {A T C A} {T A C G} …
Similarity Detection Sequence 1: Sequence 2: Seed: {0,2,4,5} A T C G A C T A T C G AC T A T C G A C T C T A G T C T C T A G TC T C T A G T C T Offset = 0 => Mismatch Offset = 1 => Match We can have multiple seeds (patterns)!
Seed Design A (hazy) problem definition Collection of ungapped genomic sequence similarities Parameters: length of seeds, resource limits ….. Algorithm A set of seeds that will give “optimum performance” (What are the parameters? How do you define optimum performance?)
Tasks in Seed Designing • Define a measure of goodness for a seed. Easy ! Sensitivity to interesting biosequence similarities. • Show how to evaluate goodness for a seed. Hard ! No efficient algorithm.
Terminology related to a Seed A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6
Computational Cost Seed weight w f Computational Cost No. of seeds n
Performance Measurement Optimum performance => Maximum sensitivity (i.e. detection probability) to the similarities S (Currently, Markov Models are used to measure these probabilities!)
Markov Models • A kth order Markov model M: Given k bits, predict (k+1)th bit 4th order Markov Model
(Exact) Problem Definition Inputs: • Number of seeds: n • Weight of each seed: w • Markov Model: M • Similarities: S Output: A set of seeds (ordered positions), P = {x11, …, x1w}, {x21,…,x2w},…,{xn1,…,pnw} that maximizes detection probability for S
Computing Detection Probabilities • Challenge: The probability of at least one match varies because the probabilities of matches at different offsets are not independent. Ex: Seed = {0,2,3,5} This similarity has 2 matches, at offsets 0 and 2, which share two of four positions in common
DFA to compute Detection Probability(Deterministic Finite Automaton) • Construct a DFA that accepts a string of 1’s and 0’s defined by the seed P. • Ex: P = {0,2} i.e. for a substring of length 3, we need a match in 1st and 3rd position. • Then the DFA should accept strings given by the regular expression: “(0+1)*1(0+1)1(0+1)*”
Dynamic Programming Algorithm To compute the detection probabilities recursively! Complexity analysis: • Size of DFA <= • Time to construct a DFA = • Time for each step = • No. of Steps = l • Total time complexity = This is faster by a factor of s2/wthan the best previous algorithm for detection probabilities
Remarks about the algorithm • Can be extended to work with a set of seeds • The DFA need not be minimal • Time complexity can be further reduced
Structure in Seed Space • Addressing the problem: When is one seed more sensitive than another? • Factors: • Parameters of the Markov model M • Similarity length : l • Smaller length: irregular behavior => We can generalize only for asymptotic cases
Asymptotic Result Let, El (P) = Event that P detects S at some offset Elc (P) = complementary event Then, A seed P is asymptotically worse than a seed P’, P < P’, if Liml Pr[ Elc (P) / Elc (P’) ] > 1 (P’ has more chances of detecting S, than P does!)
Mandala: Fast, Practical Seed Design • Seed selection: • No efficient algorithm to find optimum w, s given M (except Brute Force) • Applies local search method; global efficiency sacrificed • Training a Markov model: • Training set is adaptively selected to suit the intended application • Samples training set using LSH-ALL-PAIRS algorithm
Experimental Results - I • Avg. detection probabilities given by theoretical models for random seeds (w = 11) Solid line: M0 Dashed line: M5
Experimental Results - II • Detection probabilities for best seeds found by Mandala (k = 5) Solid: noncoding DNA model Dashed: coding DNA model
Directions for further research • Extend the model to evaluate seeds • Extend similarity models to distinguish between different classes of substitution • Construct models of multiple alignment: to compare 3 or more genomes at once
Timing - BLAT vs WU-TBLASTX • Dataset: 1000 Mouse Reads and a RepeatMasked Human Chromosome 22
Sensitivity – BLAT vs WU-TBLASTX • Dataset: 13 million Mouse Shotgun Reads and Human Chromosome 22
Sensitivity and SpecificitySingle Perfect Amino Acid K-mer Matches as a Search Criterion
Sensitivity and SpecificitySingle Near-Perfect Amino Acid K-mer Matches as a Search Criterion (one mismatch allowed)
Sensitivity and SpecificityMultiple Perfect Amino Acid K-mer Matches as a Search Criterion (2 and 3 perfect matches)
Mathematical Formula • To compute the probability that the DFA associated with a seed accepts a string randomly chosen from a Markov Model M S = similarity, l = length of S, k = order of M, delta = bit string of length k, q = a state, Phi(q) = set of states that transition to q on bit b t = no. of bits read, k’ = min {k, t}
Dynamic Programming • Initialize the recurrence: P (q0, 0, 0) = 1 • After l steps, return the sum over all k+1-mer bit strings delta.1 of P (qa, l , delta.1)