Enumerating, Sampling and Counting: some illustrative cases in biology

Enumerating discrete structures Sampling and searching Nested sampling for counting Enumerating, Sampling and Counting:some illustrative cases in biology Olivier Martin • Laboratoire de Physique Théorique et Modèles Statistiques • et • UMR de Génétique Végétale University of Paris-Sud

[ I ]: Enumerating Discrete Structures Illustrative case: trees describing pedigrees O. Martin and F. Hospital, Genetics (2004) In a breeding program, one wants to (optimally) cross a collection of “parents” to produce an ideal genome, but the mixing of the genes (Mendelian genetics) is probabilistic and depends on their mutual distances. General framework: each individual in the parental population has one good gene (resistance to one disease) and the “ideotype” must accumulate all these into one genome. The crossing of 2 parents should pass on their good genes to at least one offspring.

Transmission of genes H(1)(2) H(3)(4) s1=1 s1=3 s2=2 s2=4 We impose that a gamete cumulate all the good genes of the 2 chromosomes of its parent s = 1,2 s = 3,4 H(12)(34) s1=1,2 s2=3,4 s = 1,2,3,4

Example of a simple pedigree P1, P2, P3: founder parents I* : Ideotype

P1 P2 P3 P4 P1 P4 P2 P3 Pedigrees differ by: • A tree structure • The choice of parents P1 P2 P3 P4 Representation of a pedigree

Particular cases of pedigrees Cascade Regular pyramid Min height = log2 (n) = 3 Max height = (n -1) = 7

P1 P2 P3 P4 P5 P6 H(1)(2) H(3)(4) H(5)(6) H(12)(34) H(1234)(56) Pedigree = binary leaf-labeled tree Leaves Level 0 Level 1 Level 2 Level 3 Node

Questions • How to count the number of distinct pedigrees? • How to computer enumerate them for further use? • How to sample them uniformly? • How to find the «optimal » pedigree given that each pedigree has a cost?

n 3 4 5 8 10 20 A(n) 3 15 105 135135 3.4 x 107 8.2 x 1021 Counting the number of pedigrees For n genes, one has A(n)=(2n - 3)!! pedigrees (by recurrence equations)

Enumerationof all pedigrees Sub-pedigree pgenes n-pgenes One fuses two sub-pedigrees: - cumulating pgenes - cumulating (n-p)genes A pedigree cumulating n genes

An algorithm for constructing all pedigrees Suppose all sub-pedigrees of height at most h are known; one can generate all those of height h+1: • Examine all pairs of sub-pedigrees {P1,P2} of height h1=h et h2≤h • If P1 et P2 have no good gene in common, fuse them to form a sub-pedigree P of height (h+1) • If P cumulates all good genes, keep it, otherwise add it to the list of sub-pedigrees of height h+1 Repeat for the next height until h+1 = n-1

Working of the algorithm h=0

Working of the algorithm ... h=0 h=1

Working of the algorithm ... h=0 h=1 etc ...

Working of the algorithm ... h=0 h=1 etc ... h=2

Working of the algorithm ... h=0 h=1 etc ... h=2 etc ... h=3

Example : cascade with 4 genes

Optimal pedigrees: search by pruning the enumeration (branch and bound) Of all the ways to produce a given combination of good genes, keep only the best sub-pedigree Enumeration: one can treat up to 14 genes, Branch and bound: up to 22 genes. Case of « adjacent » cascades : dynamic programming determines the optimal pedigree in O(n2) operations

This problem is ubiquitous: Physics: equilibrium configurations Operations research: feasible solutions of CSP Statistics: estimating p-values [ II ]: Sampling and searching

To obtain samples with a given probability distribution or measure, use the Metropolis algorithm (1953) Simple, very effective if no bottlenecks If the measure is fragmented, one needs large « moves » but that almost always fails La voie royale: Monte Carlo Markov Chains

The case of biological networks: some computational challenges (1) Generate a genotype of given phenotype (oriented search) (2) Sample uniformly genotypes of a given phenotype: use symmetries to reduce exponentially the space size (3) Determine the connectivity of the neutral network: do guided search to go from one random genotype to another (4) Sample uniformly a connected component of the neutral network: use random walks (5) Sample uniformly the surface of a “ball” around a point: use Metropolis with asymmetric rates (6) Get the infinite population limit of a population under Darwinian selection: use variance reduction and 1/N extrapolation

Viable genotypes are rareS. Ciliberti, O. Martin and A. Wagner, Plos Comp. Bio. (2007) If one allows for M interactions (M non-zero entries of W) between N genes, what fraction of the genotypes (regulatory networks) are viable? By smart sampling: Illustration when M = 0.25 N2

We want to check with a high level of confidence that a certain space S is connected. We do this in three steps: Use the Metropolis MC algorithm to produce random pairs of points (P1,P2) in the space S Generate an “equilibrium” cloud of points in S around P1 by a biased Monte Carlo and store these Produce a MC chain of points in S, starting from P2, using for instance the same Monte Carlo rule as above; check for collisions with the stored set. If a collision arises, P1 is connected to P2 Showing connectivity properties of biological networks

The viable genotypes form a connected network Very few viable networks are not in the giant connected component, and the few such networks are usually isolated. Example: For M=0.25 N2, the fraction of viable networks not belonging to the giant component is: 2.3×10-3 at N=8 1.7×10-3 at N=12 1.4×10-3 at N=20

Structure in the neighborhood of a viable genotype

Neutral network topologyS. Ciliberti, O. Martin and A. Wagner, PNAS (2007)

When the measure is fragmented, resort to creating samples ab-initio and use weights Need to « guide » the construction, otherwise weights have huge variance Some cases are « easy » (Sinclair et al.): Polynomial Randomized Approximation Scheme Some difficult cases have been treated (PERM of Grassberger) but it is an art Constructive samplers

Choose at random a sufficiently small sub-regions and apply branch and bound in each to get configurations (very slow) Perform nested sampling (multiple measures interpolating to the desired one) Accept incorrect distribution and just get « some » configurations by guided stochastic search; this is OK in the context of search or “design” Other samplers

What makes a regulatory network robust and how can one « design » functional networks ? Q is a « quality » factor which measures the synergy of the WijSj The mutational robustness and our measure Q have a strong association

Sometimes it is not enough to sample feasible solutions, one may want to know their number or frequency… Physics: entropy Statistics: small p-values Operations research: size of set of feasible solutions of CSP Biology: computing neutral network sizes [ III ]: Nested sampling for counting

In a discrete space, we want to sample configurations having an unusual property, forming a fraction of say 1 in a trillion… Randomly sampling the full space won't do Often Monte Carlo won't work because the desired sub-space is fragmented Nested sampling Introduce a family of measures interpolating between the full space and the desired sub-space and use exchange Monte Carlo on the replicas

Discrete space of sequences, only a tiny fraction have the correct folding… Changing just a bit the sequence sometimes changes the folding a lot, so space is fragmented A simple choice for the measures: increasing distances to the target fold. At very short distances the measure is fragmented, but use of larger distances restores connectivity, thereby allowing the use of the Metropolis approach. Even with this simple choice, one can efficiently sample the space of interest uniformly in spite of its rarity. Extra bonus: one can both sample and count stochastically, in contrast to standard Monte Carlo. Example: cardinality of ‘neutral’ network in RNA modelingT. Jorg, O. Martin and A. Wagner, submitted to BMC Bioinformatics

In the most favourable cases, one can enumerate, sample, search (design/optimize) and count. Sophisticated algorithmic approaches based on Markov Chains allow one to sample even in intricate spaces, though at a significant computational cost. The use of nested sampling allows for approximate counting in many realistic cases. Except for enumeration, these techniques are perfectly applicable to continuous spaces. Some conclusions

Enumerating, Sampling and Counting: some illustrative cases in biology

Enumerating, Sampling and Counting: some illustrative cases in biology

Presentation Transcript

Combinatorial Problems II: Counting and Sampling Solutions

Genetic Mutations

Purposeful Sampling

Importance Sampling

Sampling Plans

Chapter 5

TDM of Digoxin Roger Jelliffe, M.D. (www.lapk.org)

Intro to Sampling Methods

Chapter 13 Sampling Designs

Sampling

Sampling Plans

Sampling

Enumerating

Sampling

7.0 Sampling and Sampling Distribution

Counting

Sampling Techniques

Counting

Measuring Populations

Sampling and monitoring the environment

Survey sampling