Index-based search of single sequences

Index-based search of single sequences Omkar Mate CS 374 Stanford University

Motivation A newly discovered gene … (may hold clues to evolution of human brain capacity ) • Occurrence in other species • Mutation • …

Sequence Alignment New Query Existing Genome Database ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… gacatta Easy??? Think again………..

State of Biological Databases ~300 sequenced genomes

Alignment Problem • Assume we try Smith-Waterman: [running time = O(MN)] The entire genomic database 1011 Our new gene 104 Time and Space complexity: O(1015) Huge number! But we can do better … 

Indexing-based Local Alignment BLAST- Basic Local Alignment Search Tool Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman! query DB hits

BLAST Step 1: Construct dictionary of query words (Query indexed by all words of size, k = 4) Query: AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… Index of query words

BLAST • Step 2 – Generate all the relatives of a word (Relative: a word with alignment score greater than a threshold, T) Query Word: ATGC, T: 50 Candidates Score Relatives! Update the index … (Query: ) AATGCCGATAGCATCG …

BLAST • Step 3: Searching • Search through database linearly, one word at a time • Initiate alignment with all occurrences of that word in query Database: ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … Query: Index:

BLAST • Step 4 – Alignment Extension Once we find an alignment, extend to left and right with no gaps until alignment score falls below a certain threshold. ATGCCGATACGATCAGCTACGATCAG… ATGCCGATACGATCAGCTACGATCAG…

Sensitivity-Speed Tradeoff Longer words => Fewer alignments => Faster but Low chance of a match Shorter words => More alignments => High chance of a match but Slower

BLAT - The BLAST-Like Alignment Tool Similarities: • Rapid scans for relatively short matches (hits) • Extend these hits into high-scoring pairs (HSP) Differences: BLAST BLAT • Index of query sequence Index of database • Scan through database Scan through query sequence • Gaps not allowed in ext. Gaps allowed in extension • Returns smaller alignments Returns larger alignments

BLAT Strategies - I • Single perfect matches We do not allow any mismatch. Common intuition : fewer matches for longer words

Sensitivity and SpecificitySingle Perfect Nucleotide K-mer Matches as a Search Criterion

BLAT Strategies - II • Allow one mismatch Intuition: Higher number of matches for same word length => Better sensitivity (Caution: Keep k higher, else no. of matches will be huge)

Sensitivity and SpecificitySingle Near-Perfect Nucleotide K-mer Matches as a Search Criterion (one mismatch allowed)

BLAT Strategies - III Allow multiple perfect matches Two parameter: N: no. of matches, K: word size Practically: Same sensitivity, higher speed

Sensitivity and SpecificityMultiple Perfect Nucleotide K-mer Matches as a Search Criterion (2 and 3 perfect matches)

Seeded AlignmentA dominant paradigm for fast comparisons • Seed: A common pattern of positions used for efficient large scale comparison of genomic DNA …G A T T A C C A G A T T A C C A G A T T A … Seed: {0,1,2,3} => Comparison sequences: {G A T T} {A T T A} {T T A C} … Seed: {0,2,4,6} => Comparison sequences: {G T A C} {A T C A} {T A C G} …

Similarity Detection Sequence 1: Sequence 2: Seed: {0,2,4,5} A T C G A C T A T C G AC T A T C G A C T C T A G T C T C T A G TC T C T A G T C T Offset = 0 => Mismatch Offset = 1 => Match We can have multiple seeds (patterns)!

Seed Design A (hazy) problem definition Collection of ungapped genomic sequence similarities Parameters: length of seeds, resource limits ….. Algorithm A set of seeds that will give “optimum performance” (What are the parameters? How do you define optimum performance?)

Tasks in Seed Designing • Define a measure of goodness for a seed. Easy ! Sensitivity to interesting biosequence similarities. • Show how to evaluate goodness for a seed. Hard ! No efficient algorithm.

Terminology related to a Seed A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6

Computational Cost Seed weight w f Computational Cost No. of seeds n

Performance Measurement Optimum performance => Maximum sensitivity (i.e. detection probability) to the similarities S (Currently, Markov Models are used to measure these probabilities!)

Markov Models • A kth order Markov model M: Given k bits, predict (k+1)th bit 4th order Markov Model

(Exact) Problem Definition Inputs: • Number of seeds: n • Weight of each seed: w • Markov Model: M • Similarities: S Output: A set of seeds (ordered positions), P = {x11, …, x1w}, {x21,…,x2w},…,{xn1,…,pnw} that maximizes detection probability for S

Computing Detection Probabilities • Challenge: The probability of at least one match varies because the probabilities of matches at different offsets are not independent. Ex: Seed = {0,2,3,5} This similarity has 2 matches, at offsets 0 and 2, which share two of four positions in common

DFA to compute Detection Probability(Deterministic Finite Automaton) • Construct a DFA that accepts a string of 1’s and 0’s defined by the seed P. • Ex: P = {0,2} i.e. for a substring of length 3, we need a match in 1st and 3rd position. • Then the DFA should accept strings given by the regular expression: “(0+1)*1(0+1)1(0+1)*”

Dynamic Programming Algorithm To compute the detection probabilities recursively! Complexity analysis: • Size of DFA <= • Time to construct a DFA = • Time for each step = • No. of Steps = l • Total time complexity = This is faster by a factor of s2/wthan the best previous algorithm for detection probabilities

Remarks about the algorithm • Can be extended to work with a set of seeds • The DFA need not be minimal • Time complexity can be further reduced

Structure in Seed Space • Addressing the problem: When is one seed more sensitive than another? • Factors: • Parameters of the Markov model M • Similarity length : l • Smaller length: irregular behavior => We can generalize only for asymptotic cases

Asymptotic Result Let, El (P) = Event that P detects S at some offset Elc (P) = complementary event Then, A seed P is asymptotically worse than a seed P’, P < P’, if Liml Pr[ Elc (P) / Elc (P’) ] > 1 (P’ has more chances of detecting S, than P does!)

Mandala: Fast, Practical Seed Design • Seed selection: • No efficient algorithm to find optimum w, s given M (except Brute Force) • Applies local search method; global efficiency sacrificed • Training a Markov model: • Training set is adaptively selected to suit the intended application • Samples training set using LSH-ALL-PAIRS algorithm

Experimental Results - I • Avg. detection probabilities given by theoretical models for random seeds (w = 11) Solid line: M0 Dashed line: M5

Experimental Results - II • Detection probabilities for best seeds found by Mandala (k = 5) Solid: noncoding DNA model Dashed: coding DNA model

Directions for further research • Extend the model to evaluate seeds • Extend similarity models to distinguish between different classes of substitution • Construct models of multiple alignment: to compare 3 or more genomes at once

Timing - BLAT vs WU-TBLASTX • Dataset: 1000 Mouse Reads and a RepeatMasked Human Chromosome 22

Sensitivity – BLAT vs WU-TBLASTX • Dataset: 13 million Mouse Shotgun Reads and Human Chromosome 22

Sensitivity and SpecificitySingle Perfect Amino Acid K-mer Matches as a Search Criterion

Sensitivity and SpecificitySingle Near-Perfect Amino Acid K-mer Matches as a Search Criterion (one mismatch allowed)

Sensitivity and SpecificityMultiple Perfect Amino Acid K-mer Matches as a Search Criterion (2 and 3 perfect matches)

Mathematical Formula • To compute the probability that the DFA associated with a seed accepts a string randomly chosen from a Markov Model M S = similarity, l = length of S, k = order of M, delta = bit string of length k, q = a state, Phi(q) = set of states that transition to q on bit b t = no. of bits read, k’ = min {k, t}

Dynamic Programming • Initialize the recurrence: P (q0, 0, 0) = 1 • After l steps, return the sum over all k+1-mer bit strings delta.1 of P (qa, l , delta.1)

Index-based search of single sequences

Index-based search of single sequences

Presentation Transcript

A single-factor security market The single-index model Estimating the single-index model

15.6 Index Based Algorithms

15.6 Index Based Algorithms

“Transforming Cells into Automata” “ Index-based search of single s equences ”

Search Server Index

Database Index to Large Biological Sequences

Single Top Search

In Search of a Single Standard

Single Top Search

Phosphorus Index Based Management

Chapter Five Index and Search

Single Top search

Phosphorus Index Based Management

Single Top Search

Single Top search

Single Top search

An analysis framework for search sequences

Why use single index model?

Single level index

Index-based NDN Repository

Formation of Optimum Portfolio using Single Index Model

Array-Based Sequences