1 / 35

Enumerating, Sampling and Counting: some illustrative cases in biology

Enumerating discrete structures Sampling and searching Nested sampling for counting. Enumerating, Sampling and Counting: some illustrative cases in biology. Olivier Martin Laboratoire de Physique Théorique et Modèles Statistiques et UMR de Génétique Végétale University of Paris-Sud.

makoto
Download Presentation

Enumerating, Sampling and Counting: some illustrative cases in biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enumerating discrete structures Sampling and searching Nested sampling for counting Enumerating, Sampling and Counting:some illustrative cases in biology Olivier Martin • Laboratoire de Physique Théorique et Modèles Statistiques • et • UMR de Génétique Végétale University of Paris-Sud

  2. [ I ]: Enumerating Discrete Structures Illustrative case: trees describing pedigrees O. Martin and F. Hospital, Genetics (2004) In a breeding program, one wants to (optimally) cross a collection of “parents” to produce an ideal genome, but the mixing of the genes (Mendelian genetics) is probabilistic and depends on their mutual distances. General framework: each individual in the parental population has one good gene (resistance to one disease) and the “ideotype” must accumulate all these into one genome. The crossing of 2 parents should pass on their good genes to at least one offspring.

  3. Transmission of genes H(1)(2) H(3)(4) s1=1 s1=3 s2=2 s2=4 We impose that a gamete cumulate all the good genes of the 2 chromosomes of its parent s = 1,2 s = 3,4 H(12)(34) s1=1,2 s2=3,4 s = 1,2,3,4

  4. Example of a simple pedigree P1, P2, P3: founder parents I* : Ideotype

  5. P1 P2 P3 P4 P1 P4 P2 P3 Pedigrees differ by: • A tree structure • The choice of parents P1 P2 P3 P4 Representation of a pedigree

  6. Particular cases of pedigrees Cascade Regular pyramid Min height = log2 (n) = 3 Max height = (n -1) = 7

  7. P1 P2 P3 P4 P5 P6 H(1)(2) H(3)(4) H(5)(6) H(12)(34) H(1234)(56) Pedigree = binary leaf-labeled tree Leaves Level 0 Level 1 Level 2 Level 3 Node

  8. Questions • How to count the number of distinct pedigrees? • How to computer enumerate them for further use? • How to sample them uniformly? • How to find the «optimal » pedigree given that each pedigree has a cost?

  9. n 3 4 5 8 10 20 A(n) 3 15 105 135135 3.4 x 107 8.2 x 1021 Counting the number of pedigrees For n genes, one has A(n)=(2n - 3)!! pedigrees (by recurrence equations)

  10. Enumerationof all pedigrees Sub-pedigree pgenes n-pgenes One fuses two sub-pedigrees: - cumulating pgenes - cumulating (n-p)genes A pedigree cumulating n genes

  11. An algorithm for constructing all pedigrees Suppose all sub-pedigrees of height at most h are known; one can generate all those of height h+1: • Examine all pairs of sub-pedigrees {P1,P2} of height h1=h et h2≤h • If P1 et P2 have no good gene in common, fuse them to form a sub-pedigree P of height (h+1) • If P cumulates all good genes, keep it, otherwise add it to the list of sub-pedigrees of height h+1 Repeat for the next height until h+1 = n-1

  12. Working of the algorithm h=0

  13. Working of the algorithm ... h=0 h=1

  14. Working of the algorithm ... h=0 h=1

  15. Working of the algorithm ... h=0 h=1 etc ...

  16. Working of the algorithm ... h=0 h=1 etc ... h=2

  17. Working of the algorithm ... h=0 h=1 etc ... h=2

  18. Working of the algorithm ... h=0 h=1 etc ... h=2 etc ... h=3

  19. Example : cascade with 4 genes

  20. Optimal pedigrees: search by pruning the enumeration (branch and bound) Of all the ways to produce a given combination of good genes, keep only the best sub-pedigree Enumeration: one can treat up to 14 genes, Branch and bound: up to 22 genes. Case of « adjacent » cascades : dynamic programming determines the optimal pedigree in O(n2) operations

  21. This problem is ubiquitous: Physics: equilibrium configurations Operations research: feasible solutions of CSP Statistics: estimating p-values [ II ]: Sampling and searching

  22. To obtain samples with a given probability distribution or measure, use the Metropolis algorithm (1953) Simple, very effective if no bottlenecks If the measure is fragmented, one needs large « moves » but that almost always fails La voie royale: Monte Carlo Markov Chains

  23. The case of biological networks: some computational challenges (1) Generate a genotype of given phenotype (oriented search) (2) Sample uniformly genotypes of a given phenotype: use symmetries to reduce exponentially the space size (3) Determine the connectivity of the neutral network: do guided search to go from one random genotype to another (4) Sample uniformly a connected component of the neutral network: use random walks (5) Sample uniformly the surface of a “ball” around a point: use Metropolis with asymmetric rates (6) Get the infinite population limit of a population under Darwinian selection: use variance reduction and 1/N extrapolation

  24. Viable genotypes are rareS. Ciliberti, O. Martin and A. Wagner, Plos Comp. Bio. (2007) If one allows for M interactions (M non-zero entries of W) between N genes, what fraction of the genotypes (regulatory networks) are viable? By smart sampling: Illustration when M = 0.25 N2

  25. We want to check with a high level of confidence that a certain space S is connected. We do this in three steps: Use the Metropolis MC algorithm to produce random pairs of points (P1,P2) in the space S Generate an “equilibrium” cloud of points in S around P1 by a biased Monte Carlo and store these Produce a MC chain of points in S, starting from P2, using for instance the same Monte Carlo rule as above; check for collisions with the stored set. If a collision arises, P1 is connected to P2 Showing connectivity properties of biological networks

  26. The viable genotypes form a connected network Very few viable networks are not in the giant connected component, and the few such networks are usually isolated. Example: For M=0.25 N2, the fraction of viable networks not belonging to the giant component is: 2.3×10-3 at N=8 1.7×10-3 at N=12 1.4×10-3 at N=20

  27. Structure in the neighborhood of a viable genotype

  28. Neutral network topologyS. Ciliberti, O. Martin and A. Wagner, PNAS (2007)

  29. When the measure is fragmented, resort to creating samples ab-initio and use weights Need to « guide » the construction, otherwise weights have huge variance Some cases are « easy » (Sinclair et al.): Polynomial Randomized Approximation Scheme Some difficult cases have been treated (PERM of Grassberger) but it is an art Constructive samplers

  30. Choose at random a sufficiently small sub-regions and apply branch and bound in each to get configurations (very slow) Perform nested sampling (multiple measures interpolating to the desired one) Accept incorrect distribution and just get « some » configurations by guided stochastic search; this is OK in the context of search or “design” Other samplers

  31. What makes a regulatory network robust and how can one « design » functional networks ? Q is a « quality » factor which measures the synergy of the WijSj The mutational robustness and our measure Q have a strong association

  32. Sometimes it is not enough to sample feasible solutions, one may want to know their number or frequency… Physics: entropy Statistics: small p-values Operations research: size of set of feasible solutions of CSP Biology: computing neutral network sizes [ III ]: Nested sampling for counting

  33. In a discrete space, we want to sample configurations having an unusual property, forming a fraction of say 1 in a trillion… Randomly sampling the full space won't do Often Monte Carlo won't work because the desired sub-space is fragmented Nested sampling Introduce a family of measures interpolating between the full space and the desired sub-space and use exchange Monte Carlo on the replicas

  34. Discrete space of sequences, only a tiny fraction have the correct folding… Changing just a bit the sequence sometimes changes the folding a lot, so space is fragmented A simple choice for the measures: increasing distances to the target fold. At very short distances the measure is fragmented, but use of larger distances restores connectivity, thereby allowing the use of the Metropolis approach. Even with this simple choice, one can efficiently sample the space of interest uniformly in spite of its rarity. Extra bonus: one can both sample and count stochastically, in contrast to standard Monte Carlo. Example: cardinality of ‘neutral’ network in RNA modelingT. Jorg, O. Martin and A. Wagner, submitted to BMC Bioinformatics

  35. In the most favourable cases, one can enumerate, sample, search (design/optimize) and count. Sophisticated algorithmic approaches based on Markov Chains allow one to sample even in intricate spaces, though at a significant computational cost. The use of nested sampling allows for approximate counting in many realistic cases. Except for enumeration, these techniques are perfectly applicable to continuous spaces. Some conclusions

More Related