1 / 27

Motive

hila
Download Presentation

Motive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An O(N2) Algorithm for DiscoveringOptimal Boolean Pattern PairsHideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru MiyanoIEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004Presented by,Sivaramakrishnan SubramanianGraduate Student, CPSC, TAMU.siv@tamu.edu

  2. Motive • Finding patterns conserved across a set of biologically related sequences to extract meaning is a common topic in Bioinformatics. • More than one sequence element can affect the biological characteristics of the sequences. • Past work on finding composite patterns- Structured Motifs, MITRA, Bioprospector…

  3. Overview • Given a set of sequences and numeric attribute values for each sequence, the problem is to find the optimal (w.r.t to a scoring function) pair of patterns combined with any Boolean function. • Past work- finds combination of 2 patterns p and q where (p^q) occur in each string • this paper’s formulation allows all possible combinations such as (p^¬q)…conditions like “presence of one element but absence of other” can be specified. • Thus this method can be used to find cooperative as well as competing sequence elements. • O(N2) Algorithm and Implementation based on suffix arrays (this is the homework!!!) are the main contributions of this paper.

  4. Preliminaries • Let ∑ be a finite alphabet & ε denote an empty string. • Let Ψ(p,s) be a Boolean matching function true only if p is a substring of s. • Boolean pattern pair: a triplet <F,p,q> where p and q are patterns and F is a 2-ary Boolean function. • Matching function value for a pattern pair Ψ(<F,p,q>,s) is defined as F(Ψ(p,s),Ψ(q,s)). • All possible F values are defined in the following table.

  5. All Candidate Boolean Operations on <F,p,q>

  6. Preliminaries • A pattern or a Boolean pattern pair ∏ matches a string s if and only if Ψ(∏,s) is true. Pattern ε matches any string. • For a given set of strings S={s1, . . ., sm} let M(∏,S) denote the set of indices of strings in S that ∏ matches, that is, M(∏,S)={i| Ψ(∏,si)=true}, and let its complement be denoted as M’(∏,S)={i|Ψ(∏,si)= false}. • For each si€S, we are given an associated numeric attribute value ri. Let R(∏,S)= ∑i€M(∏,S)ri denote the sum of ri over all si that ∏ matches. Let M(∏) and R(∏) be a shorthand notation for M(∏,S) and R(∏,S), respectively. Note that |M(ε)|=m & R(ε)=∑i=1 to mri.

  7. Scoring Function • Objective is to find a pattern that maximizes a suitable scoring function score. • The paper concentrates on scoring functions whose values for a pattern ∏ depend on values cumulated over the strings in S that match ∏. • Scoring function score takes parameters |M(∏)| and R(∏). • Also assumed that the score value computation can be done in constant time if the parameter values are known. • Specific choice for the scoring function highly depends on the particular application.

  8. Problem Definition • Given a set S={s1, . . ., sm} of strings, where each string si is assigned a numeric attribute value ri and a scoring function score: RxR=>R, find the Boolean pattern pair ∏€{<F,p,q>| p,q€∑*,F€{F0,…,F15}} that maximizes score(|M(∏)|,R(∏)).

  9. Suffix tree & GST • Edges are labeled with substrings of s. • For a node v, l(v) is the string obtained by concatenating edge labels from root to v. • For each leaf node v, l(v) is a distinct suffix of s & for each suffix there exists a leaf v. • Each node has at least 2 children; first character of the labels on the edges to its children are distinct. • GST: Given a set S={s1, . . ., sm} GST is a suffix tree for the String s1$1. . .sm$m where each $i is a distinct character that does not belong to ∑. • All paths are ended at the first appearance of $i and each leaf is labeled with idi. • O(N) space and time.

  10. Suffix tree S= caggaggaccat. The paths of the suffix tree from the root to the leaves (suffixes) are sorted in lexicographic order from left to right, each leaf corresponding to a position in the suffix array. The integer in the suffix array represents the position in the string from which the corresponding suffix starts. As[i]=j indicates s[j:n] is the ith suffix in the lexicographic ordering The lcp array represents the length of the longest path that consecutive suffixes in the suffix array share.

  11. GST (Generalized Suffix Tree) A Generalized Suffix Tree and its corresponding suffix array for the strings {facct, gctt, ctctg}.

  12. A Naïve O(N3) Algorithm • Let N= ∑i=1 to mlength(si) • O(N) candidates for a single pattern patterns of form l(v), where v is a node in the GST over the set S. (Why???) • Hence O(N2) candidate pattern pairs • For a given pair <F,l(v1),l(v2)>, the values |M(∏)| and R(∏) can be computed in O(N) time by any of the linear time string matching algorithms. • Then scoring function value is calculated in constant time given |M(∏)| and R(∏). • Time=O(N3). Space=O(N) for Suffix tree.

  13. O(N2) Algorithm • Two steps • Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space • Solve optimal pair of substring patterns problem in O(N2) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.

  14. Algorithm- First step • If R(l(v)) for all v can be found in O(N) time so can be |M(l(v)|. (when ri=1 for all i, R(l(v)=|M(l(V)|) • Let LF(v) be the set of all leaf nodes in the subtree rooted by node v. • Let ci(v) denote the number of leaves in LF(v) that have the label idi. • Let sum of leaf attributes be ∑LF(v)ri.

  15. Algorithm- First step • ∑LF(v)ri = ∑i€M(l(v))(ci(v).ri) • R(l(v)) = ∑i€M(l(v))ri = ∑LF(v)ri - ∑i€M(l(v))((ci(v)-1).ri) …(1) • Let correction factor be corr(l(v),S)=∑i€M(l(v))((ci(v)-1).ri) • In (1) ∑LF(v)ri can be calculated for all v using a linear time post-order traversal as ∑LF(v)ri = ∑v’(∑LF(v’)ri| v’ is a child node of v).

  16. Algorithm- First step • How to remove the redundancies (correcting factors) in (1)? • Let I(idi) be the list of all leaves with the label idi in the order they appear in the post-order traversal of the tree. Constructing the lists I can be done in linear time for all labels idi. • The leaves in LF(v) with the label idi form a continuous interval of length ci(v) in the list I(idi). • If ci(v) > 0, a length-ci(v) interval in I(idi) contains (ci(v)-1) adjacent (overlapping) leaf pairs. • If x,y € LF(v), the node lca(x,y) belongs to the subtree rooted by v. • For any si € S, Ψ(l(v),si)=true, that is, i€ M(l(v)) if and only if there is a leaf x € LF(v) with the label idi.

  17. Algorithm- First step • Initially correction value=0 for all v. • For each adjacent leaf pairs in I(idi) add ri to the correction value of the node lca(x,y). • For each v, sum of correction values in the nodes of the sub-tree rooted by v is (ci(v)-1).ri. • Repeat this for all lists I(idi)- the preceding total sum becomes ∑i€M(l(v))((ci(v)-1).ri) = corr(l(v),S) • Perform a linear time bottom-up (post-order) traversal to find R(l(v)).

  18. Algorithm- First step Correction values at v1,v2,v3 set to r3,r2,r3 V3:r3+r2+r3-r3 =r2+r3=R(l(v3)) V2:R(l(v3))+r2-r2 =r2+r3=R(l(v2)) V1:r1+R(l(v2))+r3-r3 =r1+r2+r3=R(l(v1))

  19. Pseudo code for Step 1

  20. O(N2) Algorithm • Two steps • Find |M(l(v))| and R(l(v)) for all nodes v of GST in O(N) time and space • Solve optimal pair of substring patterns problem in O(N2) time and O(N) space for any scoring function score provided that it can be calculated in constant time given its inputs.

  21. Algorithm- Second step • O(N) choices for the first patternl(v1) • For each l(v1) use a modified version of the previous algorithm for the O(N) choices for the second pattern,l(v2) • given a fixed l(v1), we additionally label each string si€S and the corresponding leaves in the GST with the Boolean value Ψ(l(v1),si) O(N) time. • Cumulate the sums and correction values separately for true and false values of the additional label.

  22. Algorithm- Second step • ∑i€M(l(v2))(ri | Ψ(l(v1),si)= true) =∑i€M(l(v2))(ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= true) =R(<F8,l(v1),l(v2)>) • ∑i€M(l(v2))(ri | Ψ(l(v1),si)= false) =∑i€M(l(v2))(ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= true) =R(<F2,l(v1),l(v2)>) • ∑i€M’(l(v2))(ri | Ψ(l(v1),si)= true) =∑i€M’(l(v2))(ri | Ψ(l(v1),si)= true, Ψ(l(v2),si)= false) =R(<F4,l(v1),l(v2)>) =R(l(v1)) - R(<F8,l(v1),l(v2)>) • ∑i€M’(l(v2))(ri | Ψ(l(v1),si)= false) =∑i€M’(l(v2))(ri | Ψ(l(v1),si)= false, Ψ(l(v2),si)= false) =R(<F1,l(v1),l(v2)>) =R(ε) – R(l(v1) - R(<F2,l(v1),l(v2)>) where R(ε) & R(l(v1) can be computed in linear time.

  23. Algorithm- Second step • All cumulative values of the form ∑i(ri | Ψ(l(v1),si)= b1, Ψ(l(v2),si)=b2) where b1,b2€{true,false} can be computed in linear time. • Thus R(<F,l(v1),l(v2)>) and hence the score can be computed in linear time for all pairs of the form <F,l(v1),l(v2)>, given a fixed l(v1). • Thus O(N2) for all pattern pairs. • Since the O(N) calculations for each l(v1) is independent, the same GST can be reused. Hence the space complexity is O(N).

  24. Algorithm- Second step

  25. The rest of the paper in a nutshell • Extension for k-ary Boolean function. • Implementation using suffix arrays. • Computational experiments and results. • Algorithm Variations Multiple String Attributes, Distance Restrictions.

  26. Homework • Explain the implementation of the Optimal Boolean Pattern Pair problem using suffix arrays in your own words. Also explain why is it more efficient than the suffix tree approach. Email: siv@tamu.edu

  27. THANK YOU

More Related