Counting position weight matrices in a sequence & an application to discriminative motif finding

Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois, Urbana-Champaign

GENE ACAGTGA PROTEIN Transcriptional Regulation TRANSCRIPTION FACTOR

Binding sites and motifs • Transcription factor binding sites in a gene’s neighborhood are the fundamental units of the regulatory network • Transcription factor binding is specific, hence binding sites are similar to each other, but variability is often seen • A motif is the common sequence pattern among binding sites of transcription factor

Motif models • Consensus string, e.g., ACGWGT • Position Weight Matrix (PWM)

Binding sites ACCCGTT ACCGGTT ACAGGAT ACCGGTT ACATGAT Position Weight Matrix PWM

Databases of PWMs • Transfac has ~100s of PWMs for human • Jaspar: a smaller, perhaps better curated database of PWMs • Organism specific databases coming up frequenctly • PWMs in databases often derived from experimentally validated binding sites

Bioinformatics of PWMs • Popular motif model • i.e., several motif finding algorithms that attempt to find PWMs from sequences • Gibbs sampling: one of the earliest; tries to sample PWMs with high “relative entropy” • MEME: another early algorithm; uses expectation maximization to find PWMs that best “model the sequences” • Many more algorithms to find PWMs from a set of sequences

Problem: counting motifs • Given DNA sequence, and a consensus motif (say “ACGWGT”), count the motif in the sequence • Trivial solution • What if the motif is a Position Weight Matrix (PWM) ? • Why hasn’t this problem been looked at? • Because previous algorithms used different scores of PWMs: how “sharp” they are, how well they explain data, etc.

Counting matches to a PWM: A possibility • For each site s in sequence, compute • If Pr(s | W) > some threshold, call s a site • Count number of sites in sequence • No distinction between strong and weak sites, as long as they are above threshold • binary scheme, not realistic

A wish-list (for the score) • Score should consider both strong and weak occurrences of motif • Score should assign appropriate weights to strong and weak occurrences • Score should be aware that there may also be sites of other known motifs in the sequence • The list goes on: score should be efficiently computable, score should be differentiable, score should …

The “w-score” • Defined by a probabilistic model of sequence generation • Given one or more motifs, and a background distribution, defines a probability space on sequences • A simple (zeroth order) Hidden Markov model (HMM)

W1 Wb W2 Probabilistic Model: toy example • Given two motifs W1,W2, a “background” motif Wb, and a sequence length L • Pr(Wi Wj) = pj • transition probability • When in state Wi, emit a substring s chosen with probability Pr(s | Wi) • emission probability • Stop when length of emitted sequence is L A stochastic process generating sequences of length L

W1 W2 W2 • Another possible path T2 W2 Wb Wb Wb Wb Wb W1 A “path” through the HMM • One possible path T1

W1 W2 W2 W2 Wb Wb Wb Wb Wb W1 Likelihood of sequence & paths • A path of the HMM defines the locations of motif matches • For a sequence S & a path T, the joint probability Pr(S,T) is easy to compute • Conditional probability of a path T, given the data S, is: • Strong matches make the probability higher • Paths with weak matches have lower conditional probabilities

The “w-score” • Let the number of occurrences of a motif (say W1) in path T be • Compute: • In words: An average of the motif count , with weights equal to the probability of T given S

The “w-score” (Cont’d) • Score depends both on number and quality of matches to motif. • Every substring is a potential binding site, and paths placing the motif there will contribute to the count • Pr(T | S) depends on the match strength of all motifs, not just the one being counted

An exciting new feature of this motif score The wish-list (again)  • Score should give consider both strong and weak occurrences of motif • Score should assign appropriate weights to strong and weak occurrences • Score should be aware that there may also be sites of other known motifs in the sequence  

Computational pros and cons • The w-score computation takes time, where L is sequence length, and lm is the motif length. This is relatively expensive • The w-score can be differentiated with respect to all of the PWM parameters in time • Important feature for search algorithms

Using the “w-score” in discriminative motif finding

Discriminative motif finding • Suppose we have a set of co-regulated genes, i.e., we believe they have binding sites of the same transcription factor (in their regulatory control regions) • Traditionally, motif finding tries to find these binding sites, based on over-representation, conservation etc. • Often we also know a set of genes that should NOT have binding sites of that transcription factor • Examples: ChIP-on-chip, In situ hybridization pictures of Drosophila embryo, etc.

Problem formulation • Given two sets of sequences S+ and S- • Find a motif that has many occurrences in S+ and few occurrences in S- • Maximize the difference in the average counts of the motif in the two sets • Let W(S) = count of a motif W in sequence S • Maximize:

Optimization problem • Find motif W that maximizes

Derivatives of objective function • Let Wk be the PWM entry for base  in column k • We can efficiently compute • We can efficiently differentiate our objective function

Algorithm • Search space: Set of n = 20 substrings of sequences in S+ (called “site set”) • Objective function: Construct PWM W from site-set, compute score • Length of sites is user-defined

Algorithm Current site-set C S+

Algorithm Replace one site with any site from sequence S+ Pick a replacement that improves objective function

Algorithm • Current solution (site-set): C • Candidate new solution: C • Many possibilities for C (every substring of every sequence in S+ is a possible replacement) • Evaluate objective function on each candidate C • Too slow ! • Use derivative information !

Algorithm • Estimate the objective function value for each candidate C using partial derivatives and first order approximation • Examine each candidate in decreasing order of estimated score • If a candidate C found with greater score than C, choose it.

Estimated scores 10 Accurate score Accurate score 13 Accurate score 11 Algorithm illustration Current score = 12

Algorithm Properties • Objective function has many desirable properties, but is an expensive operation • Derivative computation has the same time complexity, and is used to guide search • Avoids local optima by searching in a discretized PWM space • Performs significantly better and/or faster than Gibbs sampling and Conjugate Gradients, for this particular score

Discriminative PWM Search (DIPS) • Software available • Can easily handle data sets of ~100 sequences • Can find multiple motifs iteratively, but without masking: • Find a PWM, then include it in the model as a known PWM, find another PWM, and so on

Performance tests • Tested on synthetic data • Compared to traditional motif finder as well as two discriminative motif finders • Superior performance in the presence of “distractor” motifs • it really helps to be able to count a motif in the presence of other known motifs

Tests on Drosophila Enhancers BICOID (ACTIVATOR) Protein Concentration HEAD TAIL

Tests on Drosophila Enhancers CAUDAL (ACTIVATOR) Protein Concentration HEAD TAIL

DIPS runs • S+ = promoters of genes expressed in anterior half of embryo • S- = promoters of genes expressed in posterior half of embryo • Top motif: Bicoid ! BICOID (ACTIVATOR) Protein Concentration TAIL HEAD

DIPS runs • S+ = promoters of genes expressed in posterior half of embryo • S- = promoters of genes expressed in anterior half of embryo • Top motif: Caudal ! CAUDAL (ACTIVATOR) Protein Concentration TAIL HEAD

Summary of results

Social regulation in honey bee • Transition from nursing in the hive to foraging for food is age related, but also regulated by the needs of the colony • 32 genes demonstrated to be significantly differentially expressed in brains of nurses and foragers (21 active in foragers only, 11 active in nurses only) • DIPS run on 2Kbp promoters of these social behavior-related genes

Results on honey bee genes

Conclusion • Discriminative motif finding increasingly becoming a necessary analysis • Motif finding in the presence of other known motifs also becoming relevant • A search algorithm that maximizes any objective function of the motif counts in the sequences • (as long as its differentiable) • Several extensions and variations possible

Acknowledgements • Eric Siggia, Eran Segal • Yoseph Barash (“LearnPSSM”) • Andrew Smith (“DME”)

Reference • ISMB 2006 (Brazil); Bioinformatics journal.

Counting position weight matrices in a sequence & an application to discriminative motif finding