1 / 45

Combinational Pattern Matching

Combinational Pattern Matching. Shashank Kadaveru. Introduction. In Motif Finding problem, no particular pattern is given to search for. We infer it from the sample. In Combinational Pattern Matching, we look for exact or appropriate occurrences of given patterns in a long text. .

leola
Download Presentation

Combinational Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combinational Pattern Matching ShashankKadaveru

  2. Introduction • In Motif Finding problem, no particular pattern is given to search for. We infer it from the sample. • In Combinational Pattern Matching, we look for exact or appropriate occurrences of given patterns in a long text.

  3. Pattern Matching • Might seem simpler than Motif finding because we know what we are looking for. • But due to the large size of genomes, in practice it is difficult. • Here, we develop number of ways to make pattern matching in a long string.

  4. Repeat Finding • Genetic diseases are associated with deletions, duplications and rearrangements of long chromosomal regions. • For example, DiGeorge syndrome which results in impaired immune system and heart defects, is associated with a 3MB deletion on human chromosome 22. • The deletion removes important genes which leads to a disease.

  5. Repeat Finding • Repeats in DNA hold many evolutionary secrets. • A striking and still unexplained phenomenon is the large number of repeats in many genomes. • For example, repeats account for about 50% of the human genome.

  6. Duplicate Removal Problem • Find all unique entries in a list of integers. • For example, the list (8, 1, 5, 1, 0, 4, 5, 10, 1) has the elements 1 and 5 repeated multiple times, so the resulting list would be (8, 1, 5, 0, 4, 10)

  7. Duplicate Removal Algorithm • DUPLICATEREMOVAL(a, n) • 1 m <- largest element of a • 2 for i<- 0 to m • 3 bi <- 0 • 4 for i <- 0 to n • 5 bai <- 1 • 6 for i<- 0 to m • 7 if bi = 1 • 8 output i • 9 return

  8. Duplicate Removal Algorithm • This approach requires O(n log n) time to sort the list of size n. • Difficult to use when the entries are arbitrarily large. • Creating a list with so many values is not efficient and may not even be possible.

  9. Duplicate Removal Algorithm • If we can map any number of entries to some number between 1 and 1000 then we can use binning strategy efficiently. • The function used for mapping is called Hash Function. • The second list is called Hash Table.

  10. Hash Function • Should be easy to calculate and integer valued. • A simple hash function takes h(x) to integer remainder of x/|b|. • We want different integers in the array to map to different integers in b. • But this isn’t possible when the size of array a is larger than b.

  11. Hash Function • Example: • If h(x) = x/1000 • 7, 1007, 2007 collide and fall into the same bin. • This is called a collision. • To deal with collisions, we use Chaining.

  12. Chaining • Elements collapsing to the same bin are organized into a linked list. • Also, elements falling into the same bins reveal duplicate elements and hence solves the duplicate removal problem.

  13. Chaining • This concept can apply to many other problems. • Ex: to find exact repeats of l-mer in a genome. • If we have to find exact repeats for large values like l = 40, then we need not construct 440 entries. • We can design a suitable hashing function and use a table of much smaller size.

  14. Exact Pattern Matching • If we have a pattern p = p1 p2 . . pn and a string t = t1 t2. . tm • Pattern matching problem is used to find any or all occurrences of pattern p in the text t. • Brute force algorithm is used for solving Pattern Matching problem.

  15. Exact Pattern Matching PATTERNMATCHING(p, t) 1 n <- length of pattern p 2 m <- length of text t 3 for i<- 1 to m− n + 1 4 if ti = p 5 output i

  16. Exact Pattern Matching • Worst case running time is O (mn) • Such a worst-case scenario can be seen by searching for p = AAAAT in t =AAAA. . .AAAAA. • Peter Weiner invented a data structure called Suffix Tree that solves Pattern Matching problem in linear time O(m) for any text or pattern. • To understand this, we need to learn about Keyword Trees.

  17. Keyword Trees • Multiple Pattern Matching Problem: • Given a set of patterns and a text, find all occurrences of any of the patterns in the text. • For example, if t = ATGGTCGGT and p1 = GGT, p2 = GGG, p3 = ATG, and p4 = CG, then the result of an algorithm that solves the Multiple Pattern Matching problem would be positions 1, 3, 6, and 7.

  18. Keyword Trees • Multiple Pattern Matching problem with k patterns can be reduced to k individual Pattern Matching problems and solved in O(knm) time. • However, there exists an even faster way to solve this problem in O(n + m) time where n is the total length of patterns p1, p2, . . . , pk

  19. Keyword Trees • The key- word tree for the set of patterns apple, apropos, banana, bandana, and orange is shown in figure.

  20. Keyword Trees • The keyword tree has at most N vertices where N is the total length of all patterns, but may have fewer. • One can construct the keyword tree in O(N) time by progressively extending the keyword tree.

  21. Suffix Trees • Suffix trees allow one to preprocess a text in such a way that for any pattern of length n, one can answer whether or not it occurs in the text, using only O(n) time. • It doesn’t depend on the length of the text.

  22. Suffix Trees • Patterns: proud, perfect, muggle, ugly, rivet • t = “mr and mrsdursley of number 4 privet drive were proud to say that they were perfectly normal thank you very much”

  23. Suffix Trees

  24. Suffix Trees

  25. Suffix Trees • Asuffix tree for a text of length m can be constructed in O(m) time, which leads immediately to a linear O(n + m) algorithm for the Pattern Matching problem • Construct the suffix tree for t, in O(m) time, and then check whether p occurs in the tree, which requires O(n) time.

  26. Suffix Trees • The suffix tree for a text t = t1 ...tm is a rooted labeled tree with m leaves (numbered from 1 to m) satisfying the following conditions • Each edge is labeled with a substring of the text • Each internal vertex (except possibly the root) has atleast 2 children, • Any two edges out of the same vertex start with a different letter, • Every suffix of text t is spelled out on a path from the root to some leaf.

  27. Suffix Trees

  28. Suffix Trees • SUFFIXTREEPATTERNMATCHING(p, t) • Build the suffix tree for text t • Thread pattern p through the suffix tree. • if threading is complete • output positions of every p-matching leaf in the tree • else • output “pattern does not appear anywhere in the text”

  29. Heuristic Similarity Search Algorithms • The suffix tree algorithm above, while fast, can only find exact, rather than approximate, occurrences of a gene in a database. • When we are trying to find an approximate match to a gene of length 103 in a database of size 1010, the programming algorithms (like the Smith- Waterman local alignment algorithm) may be too slow.

  30. Approximate Pattern Matching • Finds all approximate occurrences of a pattern in a text. • The naive brute force algorithm for approximate pattern matching runs in O(nm) time. • The following algorithm will output all locations in t where p occurs with no more than k mismatches.

  31. Approximate Pattern Matching • APPROXIMATEPATTERNMATCHING(p, t, k) • n ← length of pattern p • m ← length of text t • for i←1tom−n+1 • dist ← 0 • for j←1to n • if ti+j−1 != pjdist← dist + 1 • if dist ≤ k • output i

  32. Approximate Pattern Matching • In 1985 Gadi Landau and UdiVishkin found an algorithm for approximatestring matching with O(km) worst-case running time. • Although this algorithm yields the best known worst-case performance, it is not necessarily the best in practice.

  33. Heuristic Similarity Search Algorithms • Many heuristics for fast database search in molecular biology use the same filtration idea. • In 1973 Donald Knuth suggested a method for pattern matching with one mismatch based on the observation that strings differing by a single mismatch must match exactly in either the first or second half.

  34. Heuristic Similarity Search Algorithms • In 1985 the idea of filtration in computational molecular biology was used by David Lipman and Bill Pearson, in their FASTA algorithm. • It was further developed in BLAST, now the dominant database search tool in molecular biology.

  35. Heuristic Similarity Search Algorithms • Biologists frequently depict similarities between two sequences in the form of dot matrices. • A dot matrix is simply a matrix with each entry either 0 or 1, where a 1 at position (i, j) indicates some similarity between the ith position of the first sequence and the jth position of the second sequence.

  36. Heuristic Similarity Search Algorithms

  37. BLAST: Comparing a Sequence against a Database • Using shared l-mers for finding similarities, as FASTA does, has some disadvantages. • For example, two proteins can have different amino acid sequences but still be biologically similar. • A common construct in many heuristic similarity search algorithms, including BLAST, is that of a scoring matrix similar to the scoring matrices introduced in chapter 6. • These scoring matrices reveal similarities between proteins even if they do not share a single l-mer.

  38. BLAST: Comparing a Sequence against a Database • BLAST, uses scoring matrices to improve the efficiency of filtration and to introduce more accurate rules for locating potential matches. • Another powerful feature of BLAST is the use of Altschul-Dembo-Karlin statistics for estimating the statistical significance of found matches.

  39. BLAST: Comparing a Sequence against a Database • A segment pair is just a pair of l-mers, one from each sequence. • The maximal segment pair is a segment pair with the best score over all segment pairs in the two sequences. • A segment pair is locally maximal if its score cannot be improved either by extending or by shortening both segments.

  40. BLAST: Comparing a Sequence against a Database • BLAST attempts to find all locally maximal segment pairs in the query and database sequences with scores above some set threshold. • A fast algorithm for finding such l-mers is the key ingredient of BLAST.

  41. BLAST: Comparing a Sequence against a Database • An important observation is that if the threshold is high enough, then the set of all l-mers that have scores above a threshold is not too large. • In this case the database can be searched for exact occurrences of the strings from this set using Multiple Pattern Matching Algorithm. • After the potential matches are located, BLAST attempts to extend them to see whether the resulting score is above the threshold.

  42. BLAST: Comparing a Sequence against a Database • In recent years BLAST has been further improved by allowing insertions and deletions and combining matches on the same and close diagonals. • BLAST reports matches to sequences that either have one segment score above the threshold or that have several closely located segment pairs that are statistically significant when combined.

  43. BLAST: Comparing a Sequence against a Database • According to the Altschul-Dembo-Karlin statistics, the num- ber of matches with scores above θ is approximately Poisson-distributed, with mean • where m and n are the lengths of compared sequences, and K is a constant. Parameter λ is a positive root of the equation

  44. BLAST: Comparing a Sequence against a Database • where px and py are frequencies of amino acids x and y from the twenty- letter alphabet A and δ is the scoring matrix. • The probability that there is a match of score greater that θ between two “random” sequences of length n and m is computed as 1 − eE(θ).

  45. Any Questions?Thank You

More Related