Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Erik van Nimwegen et al. Presented by Lyndsy Kron Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Goal • To derive a unique probability distribution for assignments of binding sites into clusters – to identify regulons • Based on sequence similarity • Partitioned so each cluster corresponds to those targeted by same TF

PROCSE Algorithm • Uses Monte Carlo sampling of this distribution to partition and align thousands of short DNA sequences into clusters • Determines number of clusters • Assigns significance to the resulting clusters • WMs are unknown – limiting factor

WM Unknown • A set of sites sampled with unknown WMs is clusterable if it is possible to infer which sites were sampled from the same WM • If WMs are known, this task is trivial

Problem Input A set D of short DNA sequences Output Most probable clustering C of input sequences

Assumptions • Sequences in a cluster come from the same motif • Use weight matrix (WM) model for motifs • Consider only evolutionary conserved non-coding regions of orthologous genes • Consider bacterial genomes

Model • WM: prob. of finding base alpha at location i • Information score I – scores quality of an alignment of putative binding sites • b is background frequency of base alpha • And are the WM probs. from sequence

Model • Need to cluster a set of binding sites of an unknown number of TFs • Consider all ways to partition into clusters and assign prob. to each – prob. of partition is product of probs. for each cluster

Model • To calculate prob. that a set of n length l sequences S was drawn from a given WM

Model • To calculate P(S) that all sequences in S came from some w

Model • From this we can define for any partition C of a data set of sequences D into clusters the likelihood P(D|C) that all sequences in a cluster were drawn from the same WM: P(D|C) = given by P(S) on previous slide

Model • Posterior prob. P(C|D) for partition C given the data D: • Allows calculation of any statistic of interest by summing over the appropriate partitions C

Classifiability vs. Clusterability • Classification: • Associating TFs with WMs • P(s|w) – prob. that w binds to s • Implies that for a sample s from w, we have that P(s|w) > P(s|w') for all other TFs w'

Clusterability: • Assume clustering nG sequences obtained by sampling n times from G different WMs • Can calculate prob. That m of its n samples cocluster by summing P(C|D) over all partitions in which m samples of w occur together • Clusterable if for more than ½ of the G WMs the avg. of m > n/2

Monte Carlo Implementation • Monte Carlo random walk to sample the distribution P(C|D) • At each step • Choose mini-WM at random • Consider reassigning it to a randomly chosen cluster • Evaluated using Metropolis-Hastings scheme

Metropolis-Hastings Scheme • Moves that increase prob. P(C|D) are always accepted • Moves that lower P(C|D) are accepted with prob. P(C'|D)/P(C|D)

Result of Monte Carlo • Generates “dynamic” clusters, membership fluctuates over time • Clusters can disappear altogether • New clusters can appear when a pair of mini-WMs is moved together • Find “significant” clusters by finding sets of mini-WMs that are persistent

Solutions to Lack of Persistence • Search for ML partition to maximize P(C|D) through simulated annealing • Raise P(C|D) to the power β, increasing β over time • Provides candidate clusters • Significance of ML clusters are tested by sampling P(C|D)

Complications: • Computationally prohibitive for large data sets

Solutions to Lack of Persistence • Second Approach: • Use several Monte Carlo random walks • Measure prob. that each pair of mini-WMs coclusters • Construct graph, node corresponds to mini-WMs, edges between mini-WMs i and j exist if their coclustering prob. Is > ½

Second Approach Cont. • Candidate clusters are now given by connected components of graph • Pairwise stats. are processed to obtain prob. cluster membership • Yields probabilities that mini-WM i belongs to cluster j • Also calc. for each cluster the prob. distribution p(k) of k of its members coclustering • Cluster significance judged from p(k)

Finally • Once clusters are inferred, a WM can be estimated for each cluster • Then search for additional matching motifs to the cluster WMs in all regulatory regions

Data set • 15-25 bp sequences

Ex. Alignment

Thank you!

Results • Found that likelihood P(C|D) for the partition obtained in annealing runs is higher than that obtained when the sites are partitioned by annotation • Algorithm recovers almost ½ of all regulons for which binding sites are known and the large majority of regulons for which there are more than 3 sites known • Most E. coli binding sites are in the unclusterable regime

Discussion • Algorithm assumes all WMs be of fixed length, so prior information about lengths and their dimeric nature need to be incorporated • Could also extend the hypothesis, by assuming that only some fraction, rather than all, of the sequences are WM samples – others are background model

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics