300 likes | 495 Views
String Matching. String matching: definition of the problem (text,pattern). depends on what we have: text or patterns. Exact matching:. The patterns ---> Data structures for the patterns. 1 pattern ---> The algorithm depends on |p| and | |. Regular Expressions. Extensions.
E N D
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns • Exact matching: • The patterns ---> Data structures for the patterns • 1 pattern ---> The algorithm depends on |p| and || • Regular Expressions • Extensions • k patterns ---> The algorithm depends on k, |p| and || • The text ----> Data structure for the text (suffix tree, ...) • Approximate matching: • Dynamic programming • Sequence alignment (pairwise and multiple) • Sequence assembly: hash algorithm • Probabilistic search: Hidden Markov Models
Sequence assembly It is applied to the following topics: DNA sequencing , EST assembly But in the last years new lab technologies ,called “next sequencing generation”, have emerged.
DNA sequencing There are two techniques: • Hibridization: provide information about l-tuples • present in DNA. • Shotgun: DNA sequences are broken into • 100Kb-500Kb random fragments.
DNA sequencing There are two techniques: • Hibridization: provide information about l-mers • present in DNA • Shotgun: DNA sequences are broken into • 100Kb-500Kb random fragments.
Hybridization Let xxxxxxxxxxxxx be the sequence we want to know, and the hybridization technique gives us the set of 3-mers that belong to it: AAC GAT TGC ACG CGG GCC TTG GGA ATT How can the sequence be reconstructed?
Hybridization This relation can be represented with a directed graph AAC ACG Given the 3-mers of the sequence: AAC GAT TGC ACG CGG GCC TTG GGA ATT As AAC and ACG belong to the sequence, then AACG belongs to the sequence, because the longest (not proper) suffix of AAC matches the longest (not proper) prefix of ACG.
Hybridization Construction of the complete suffix-prefix graph AAC GAT TGC ACG CGG GCC TTG GGA ATT that gives us the unknown sequence: AACGGATTGCC But, is this a realistic case?
Hybridization AAC CAA GAT TGC ACG CGG GCC TTG GGC GGA CCG ATT Let us introduce a more realistic case: and the sequence is given by the Hamiltonian path that is the path that traverses all nodes exactly once and whose cost is NP-Complet! Which is the cost of the hybridization method?
Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4L l-mers of length L that should be generated 2. Searching for the suffix-prefix matches : If there are m L-mers, then there are O(m2 L2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet
Excursió: cost m t = 1 mseg 10m 10t = 10 mseg 1000m 1000t = 1 seg m t = 1mseg. 10m 100t = 100 mseg. 1000m 1000000t = 16 min m t = 1 mseg. 10m 210 t = 1 seg 1000m 21000 t = 1030 t = 1018 anys Linear cost: O(m) Quadratic cost: O(m2 ) Exponencial cost: O(2m )
Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4L l-mers of length L that should be generated 2. Searching for the suffix-prefix matches : If there are m L-mers, then there are O(m2 L2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet How the NP-completness can be avoided?
Hybridization: AAC GAT TGC ACG CGG GCC TTG GGC GGA CCG ATT AA GA TG AC GC TT CG GG CC AT Search for the Hamiltonian path (NP-complet) or search for the Eulerian path (lineal)
Hybridization: Eulerian path Search for the Eulerian path of the graph: Unbalanced nodes: indegree = outdegree (Starting or ending nodes ) Balanced nodes: indegree = oudegree (traversed nodes: )
Hybridization: Eulerian path Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
Hybridization: camí Eulerià Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4L l-mers of length L that should be generated 2. Searching for the suffix-prefix matches : If there are m L-mers, then there are O(m2 L2 ) comparisons 3. Searching for the Eulerian path Linear cost Now, which is the limiting factor?
Hybridization: limiting factor AAC CAA GAT TGC ACG CGG GCC TTG GGA ATT GAC Given the graph: Repeated l-mers: How many sequences can be assembled? CAACGGATTGCC CAACGGACGGATTGCC Which is the probability of a repeat?
Hybridization: statistical model How the probability of a repeat can be computed? Model: random sequence of length N with identically distributed bases (1/4), Given 2 l-mers, the probability to match is : 4-L Given 3 l-mers, the expected number of 2-matches is : (32)4-L Given m l-mers, the expected number of 2-matches is: (m2)4-L then for L = 8, m =512! If (m2)4-L<1 then m<sqr(2·4L) Conclusion: this technique can be applied only to short sequences.
DNA sequencing There are two techniques: • Hibridizationació: provide information about l-mers • present in DNA • Shot gun: DNA sequences are broken into • 100Kb-500Kb random fragments.
Shotgun With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx • It is possible : • to make some copies • to break it into random and unsorted short segments What can we do?
Shotgun: algorisme Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxx xxxx|xxxxxx|xxxxxx|xxxxxxx The algorithm is: 1st. Compare all pairs searching for suffix-prefix approximate matches. 2nd. Construct the graph suffix-prefix 3th. Find the path
Shotgun Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx The shotgun brokes it into the following segments accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt
Shotgun The pairwise comparison that searchs for suffix-prefix approximate matching can be done with: • Dynamic programming ( quadratic cost) • two steps: • Find the pairs suspected to be assembled • (Linear cost with the hash algorithm) • Assembly them with dynamic programming.
Shotgun tacctt accttta tttaac taacga accgtacc acgatac accgt accg gataca tacaggt Given the graph accgtacctttaacgatacaggt but, the Hamiltonian has exponential cost!
Shotgun: xxxxx xxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxxx accgt xxxxxxx accg xxxxxxx New problems arise • Consecutive repeats • Lack of coverage • …
Shotgun: properties of the coverage Some questions arisess: • What is the percentage of coverage? • How many contigs we have to expect? • What is the mean length of contigs? Given the coverage:
Shotgun: percentage of coverage L N d The probability that Prob{X=k}= (d/L)k (1-d/L)n-k a base was covered by k segments is given by the binomial dsitribution (N,d / L): N k Given the model Degree of coverage N d / L We assume that segments are randomly distributed.
Shotgun: percentage of coverage What is the limit of the binomial distribution n i p 0 having np= Distribució de Poisson P() Prob{X=k}= e- k k! Then the probability that at least one segment covers a base is Prob{X>0}= 1-Prob{X=0}= 1- e- = 1- e(N d / L) Then, with N d / L = 4.6 we obtain a 99% of coverage and with N d / L = 6.9 weobtain a 99.9% of coverage.