1 / 102

Erice - Structured Pattern Detection and Exploitation

Erice - Structured Pattern Detection and Exploitation. Deterministic Algorithms. Outline. Structured patterns - weird things happen - chance or necessity? Structured pattern detection - suffix trees, viruses and integer linear programming

kalb
Download Presentation

Erice - Structured Pattern Detection and Exploitation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Erice - Structured Pattern Detection and Exploitation Deterministic Algorithms

  2. Outline • Structured patterns - weird things happen - chance or necessity? • Structured pattern detection - suffix trees, viruses and integer linear programming • ??more on suffix arrays - computing LCP in linear time; finding close repeats • detecting and exploiting patterns of recombination in binary (SNP) sequences. • Adding in the complication of gene conversion.

  3. Kmer frequency • Kmers in a string S over all K • Ex. S = abxabcxab K = 2, the Kmers are: ab, bx, xa, ab, bc, cx, xa, ab five distinct 2-mers, ab and ax repeat K = 3: abx, bxa, xab, abc, bcx, xab five distinct 3-mers, xab repeats K = 4: abxa, bxab, xabc, abcx, bcxa, cxab six distinct 4-mers, no repeats K = 5,6,7,8,9 have 5,4,3,2,1 distinct Kmers, no repeats

  4. Weird (non-obvious) Patterns? • K* = Maximum K such that some Kmer repeats in S. • K’ = K where the number of distinct Kmers is maximum. • D = number of distinct Kmers for K = K’ • Observations from Data: K’ = K* + 1, and D + K* = |S| Chance or necessity? - Deterministic Necessity! What would a statistical approach conclude?

  5. String Barcoding Uncovering Optimal Virus Signatures Sam Rash, Dan Gusfield University of California, Davis.

  6. Motivation • Need for rapid virus detection • Given • unknown virus • database known viruses • Problem • identify unknown virus quickly based on a small set of substrings.

  7. Motivation • Real World • only have sequence for pathogens in database • not possible to quickly sequence an unknown virus • can test for presence small (<= 50 bp) strings in unknown virus • substring tests • Another Idea • String Barcoding • use substring tests to uniquely identify each virus in the database • acquire unique barcode for each virus in database

  8. Problem Definition • Formal Definition • given • set of strings S • goal • find set of strings S’, the testing set • such that for each s1, s2in S, there exists at least one u in S’ where u is a substring of only s1 • u is a signature substring • minimize |S’| • result • barcode for each element on S

  9. Example Figure 1.5 - signatures

  10. Problem Complexity • Complexity • unknown if NP-hard when size of any uin S’ is unbounded • Max-Length String Barcoding • additional parameter k, a maximum length of any u in S’ • this variant is NP-Hard • reduction from Minimum Testing Set (Garey, Johnson, 1979) • means all real world uses have to deal with NP-hard result

  11. Implementation • Basic Idea: Formulate problem as an ILP • Enumerate some “useful” set of substrings from S • variable in ILP for each substring • Constraint for each pair of strings in S • means that at least one substring will be chosen to distinguish each pair • Objective Function • Minimize sum of variables in ILP

  12. Implementation • Key point: complexity of ILP primarily a function of the number of variables • reducing number of candidate substring tests reduces the number of variables in ILP • how to reduce? • Key to our method: suffix trees • finds minimum cardinality set of “useful” substrings for use as candidate signature substrings

  13. Implementation: Suffix Trees • Key Properties of Suffix Tree build for set of strings S • tree with character sequences labeling edges • nodes labeled with a subset of original string IDs • every substring of original input set appears as a root-edge walk exactly once • root-node walk is considered root-edge walk into node’s in-edge from parent

  14. c g a c a g t t a g t t c c g a g t t Implementation: Suffix Trees • root-edge walk • Creates string • appears in exactly the strings that label the node at which it ends • 2 root-edge walks ending onthe same edge • Both strings created by the walk occur in exactly the same set of original strings • Can use ether string example - a root edge walk

  15. Implementation: Solving • If two substrings occur in exactly the same set of original strings, only one need be considered • Use strings from suffix tree for each uniquely labeled node • Build ILP as discussed • Solve ILP using CPLEX • Acquire barcode and signatures for each original string • signature is the set of substring tests occurring in a string

  16. v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3} v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3} v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3} v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3} v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1} Implementation: Example • strings: 1. cagtgc 2. cagttc 3. catgga • Each node in the suffix tree has a corresponding set of string IDs below it Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga Figure 1.2 - table of string labels for each node in suffix tree from figure 1.1

  17. Implementation: Example minimize V18 + V22 + V11 + V17 + V8 #objective function st V18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimum V18 + V17 + V8 >= 1 #constraint to cover pair 1,2 V22 + V11 + V8 >= 1 #constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3 binaries #all variables are 0/1 V18 V22 V11 V17 V8 end Figure 1.3 - ILP constructed for suffix tree in figure 1.1 using no additional constraints (length, etc) Figure 1.4 - barcodes Figure 1.5 - signatures

  18. Implementation: Extensions • minimum and maximum lengths on signature substrings • acquire barcodes/signatures for only a subset of input strings (wrt to whole set) • minimum string edit distance between chosen signature substrings • redundancy • require r signature substrings to differentiate each pair • adds a higher level of confidence that signatures remain valid even with mutations

  19. Results: Summary • Works quickly on most moderately sized datasets (especially when redundancy >= 2) • dataset properties • ~50k virus genomes taken from NCBI (Genbank) • 50-150 virus genomes • average length of each genome ~1000 characters • total input size ranged from approximately 50,000 – 150,000 characters • increasing dataset size scaled approximately linearly • reach 25% gap (at most 1/3 more than optimum) in just a few minutes • reach small gap (often < 1%) in 4 hours

  20. Results: Summary • increasing redundancy greatly decreases run time and % gap at 4 hours in all cases tested Figure 2.1 - effect of redundancy on avg 25% gap Figure 2.2 - effect of redundancy on avg gap at 4 hours

  21. Conclusion • Practical sized testing sets obtained on reasonable sized input datasets • testing set consisting of 50 – 270 substring tests on input sets of ~100 genomes • works well with reactions that have high number of assays (substring tests) per reaction • GeneChip – 400 assays per reaction • Redundancy • Good concept in theory • Reduces solution space and hence computation time • GeneChip makes higher number of assays needed cost-effective

  22. Recognizing Patterns of Historical Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yun Song, Yufeng Wu.

  23. Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix).

  24. Network with Recombination Deriving a Set of Sequences given set 10100 10000 01011 01010 00010 10101 12345 00000 1 4 3 00010 2 10100 5 Only one mutation per position is allowed. P 10000 01010 01011 5 S 10101

  25. The biological Problem Given a set of binary sequences derived by one mutation per position and possibly many recombinations, find the positions where clusters of historical recombinations likely occurred. These are called recombination hotspots. Applications: 1) Insight into the mechanics of recombination Science article October 14, 2005, and Nature article this week: in humans and chimps most recombinations occur in hotspots, but in different places in humans compared to chimps. 2) Association mapping: A major strategy being developed for finding genes that influence disease - the whole strategy relies on the historical effects of recombination.

  26. Two Approaches • Stochastic models of recombination and mutation - maximum likelihood - very intensive computations. • Deterministic approaches based on minimizing the number of needed recombinations, or bounding that number. Regions where the minimum number is large, or where close bounds on the minimum are large, indicate regions of hotspots. Science article.

  27. Reconstructing the Evolution of Binary Bio-Sequences • Perfect Phylogeny (tree) model • Phylogenetic Networks (DAG) with recombination • Phylogenetic Networks with disjoint cycles: Galled-Trees • Phylogenetic Networks with unconstrained cycles: Blobbed-Trees • Combinatorial Structure and Efficient Algorithms • Efficiently Computed Lower and Upper bounds on the number of recombinations needed

  28. The Perfect Phylogeny Model for binary sequences sites 12345 Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 The tree derives the set M: 10100 10000 01011 01010 00010 2 10100 5 10000 01010 01011 Extant sequences at the leaves

  29. When can a set of sequences be derived on a perfect phylogeny? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

  30. A richer model 10100 10000 01011 01010 00010 10101 added 12345 00000 1 4 3 00010 2 10100 5 pair 4, 5 fails the three gamete-test. The sites 4, 5 ``conflict”. 10000 01010 01011 Real sequence histories often involve recombination.

  31. Network with Recombination 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 3 00010 2 10100 5 P 10000 01010 The previous tree with one recombination event now derives all the sequences. 01011 5 S 10101

  32. Elements of a Phylogenetic Network (single crossover recombination) • Directed acyclic graph. • Integers from 1 to m written on the edges. Each integer written only once. These represent mutations. • A choice of ancestral sequence at the root. • Every non-root node is labeled by a sequence obtained from its parent(s) and any edge label on the edge into it. • A node with two edges into it is a ``recombination node”, with a recombination point r. One parent is P and one is S. • The network derives the sequences that label the leaves.

  33. A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100

  34. Minimizing recombinations • Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. • However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. • problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem).

  35. Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004) Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time. We can solve the minimization problem in polynomial time, when node disjoint recombination cycles are possible.

  36. Recombination Cycles • In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. • The cycle specified by those two paths is called a ``recombination cycle”.

  37. Galled-Trees A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle. A phylogenetic network is called a “galled-tree” if every recombination cycle is a gall.

  38. A galled-tree generating the sequences generated by the prior network. 4 3 1 s p a: 00010 3 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101

  39. Sales pitch for Galled-Trees Galled-trees represent a small deviation from true trees. There are sufficient applications where it is plausible that a galled tree exists that generates the sequences. Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable recombinations may be a galled-tree. The number of recombinations is never more than m/2. Moreover, when M can be derived on a galled-tree, the number of recombinations used is the minimum number over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination. A galled-tree for M is ``almost unique” - implications for reconstructing the correct history.

  40. Old (Aug. 2003) Results • O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence. • Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution. • Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic-networks with all-0 ancestral sequence.

  41. New work We derive the galled-tree results in a more general setting that addresses unconstrained recombination cycles and multiple crossover recombination. This also solves the problem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral sequence is known in advance.

  42. Blobbed-trees: generalizing galled-trees • In a phylogenetic network a maximal set of intersecting cycles is called a blob. • Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. • So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.

  43. Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship? Ugly tangled network inside the blob.

  44. Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

  45. 1 2 3 4 5 Incompatibility Graph a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 M 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

  46. The connected components of G(M) are very informative • The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson). • When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al 2003-5)

  47. Simple Fact If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j. (This is a general fact for all phylogenetic networks.) Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.

  48. A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100

  49. Simple Consequence of the simple fact All sites on the same (non-trivial) connected component of the incompatibility graph must be on the same blob in any blobbed-tree. Follows by transitivity. So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the incompatibility graph.

More Related