160 likes | 267 Views
A Likely(hood) Story: Statistical Inference and the Problem of Understanding Hi-Throughput Transcription Factor Binding Assays. Curtis G. Callan, Jr. Physics Department Princeton University. Reporting on work with Justin Kinney and Gasper Tkacik. Acknowledgements.
E N D
A Likely(hood) Story: Statistical Inference and the Problem of Understanding Hi-Throughput Transcription Factor Binding Assays Curtis G. Callan, Jr. Physics Department Princeton University Reporting on work with Justin Kinney and Gasper Tkacik NIH Center Retreat
Acknowledgements • Collaborators (responsible for most of the good ideas): • Support: Burroughs-Wellcome Program in Quantitative Biology NIH Center for Systems Biology at Princeton University • Reference: J Kinney, G Tkacik & C Callan PNAS104:501(2007) Justin Kinney (PU Grad Student) Gasper Tkacik (PU Grad Student) NIH Center Retreat
TF-DNA Energy Models from Binding Assays • Transcription factors (TFs) are DNA-binding proteins which regulate gene transcription: key mechanism in all organisms. • A quantitative understanding of gene regulation, especially its evolution, requires a quantitative understanding of TF-DNA interaction, i.e. sequence-dependent binding energy (SDBE). • High-throughput experiments can give massive amounts of (rather noisy) information on TF-DNA binding. Popular examples are • PBM: protein binding microarrays (in vitro) • ChIP-chip: chromatin immuno-precipitation microarrays (in vivo) • Usual goal: use the data to identify the TF binding sites (yes-no answer) • Our goal: infer quantitative SDBE models from this noisy data. We will take the statistical inference approach used by physicists to deal with WMAP/HEP data. Outcome is a probability distribution on model space. NIH Center Retreat
bind label scan PBM Assay Overview (Mukherjee et al) • Uses dsDNA microarrays to simultaneously assess TF binding to all intergenic regions of S. cerevisiae. • Fluorescence log-intensity ratios (LIRs) are filtered, averaged over replicates and normalized to taste. Each sequence siis assigned some best measured value zi (for i = 1,2, …, N intergenic regions). • Connection between these measured values and whether a TF is bound to the region (or not) is very noisy due to the complicated and loosely-controlled chemistry. How to interpret the data? • ChIP-chip assay (in vivo) produces similar-looking data. NIH Center Retreat
A 1.2 1.8 5.6 2.5 0.0 6.0 C 3.7 0.0 0.5 1.2 5.2 0.0 G 2.9 0.1 0.8 0.6 1.3 1.4 T 0.0 3.0 0.0 0.0 3.1 3.2 Simple Binding Energy (and Binding) Model • Bases within a site (length L) contribute additivelyto the binding energy. Model is a 4xL “energy matrix” M. • A stretch of DNA is “bound” if it contains a site with E<1 (other-wise “unbound”). Site occupancy modeled by step function. • A matrix model M predicts if any DNA sequence s isbound (x=1) unbound (x=0). • How does this compare with what is seen in the expt? Does any choice of M fit the data? L= 6 Site = TGTGAC Energy = 0.7 < 1 C A T G T G A C C T Region is “bound” ( binary x ) ( continuous z ) NIH Center Retreat
Connecting “Theory” and Experiment: • Fluorescence z of a bound (unbound) region is probabilistic (due to chemistry, etc.). Sum-marized by ``error models’’ for the two states: • Experiment sees only the histogram of net fluorescence N(z)=N0p(z|0)+N1p(z|1) due to N0“unbound” + N1 “bound” genes. Usually try to discriminate the two states by a “cut” on z. • How good is a model M ? If it predicts {xi}, the likelihood of data {zi} is: • Bayes’ Rule then gives the likelihood of the model, given the data: • Result is prob dist’n on model space: basis for statistical inference. Good! Small hitch: the actual error model is usually totally unknown! product over all regions with model prior p(M) NIH Center Retreat
Mutual information appears! Quenching the Error Model: EMA Likelihood • In ignorance of the true error model, we will average data likelihood over all error models to get an error-model-averaged (EMA) likelihood. • To actually do this average, we need to discretize the continuous data • Bin each region si according to fluorescence (discretize N(z)) • Find predictions {xi}of model M, record counts czx per bin (parse bins into separate binding states) • EMA likelihood is functional integral: • Our binned data yield a simple formula: m bins with equal #s of regions .. each bin splits into two states C1 1 Cm 1 C1 0 Cm 0 Practical algorithm for evaluatingp(M|{zi}) (up to normalization!) NIH Center Retreat
Monte Carlo (MCMC) Evaluation of p(M|{zi}) • Random walk in space of energy models (i.e. elements of matrix M) • Random start or … • Accept/reject steps by change in p(M|{zi}). • After “burn-in”, get model ensemble dist’d according to p(M|{zi}) ! • Do 107 steps, record 4x103 models at 103 step intervals. • Calculate statistics by ensemble averaging • Test for “burn-in” by comparing inter- and intra-run variances ChIP-chip PBM Key result: p(M|{zi}) has a single smooth peak, easily found by MCMC! NIH Center Retreat
MCMC Results for TF ABF1p (Yeast) RTCRYNNNNNACGW Y=C,T R=A,G • MCMC generates 40,000 matrices M sampled from p(M|{zi})using EMA likelihood. • All 80 matrix elements have well-sampled distributions (see insets). • Mean matrix makes perfect sense in terms of the known motif (more later) • Distributions are amazingly tight: most RMSDs ≤ 5% of functional range. • Meaningful structure, even in the middle of the binding site, where there is little specificity. • That the data imply a smooth prob-ability landscape in the high-dimen-sional model space is a surprise. • No one model is the “best” model. We can now treat model predictions as clean probabilistic statements. NIGMS Center Retreat
PBM vs. ChIP-chip Data Analysis • PBM and ChIP-chip data give very similar matrices (but with the ChIP-chip cutoff set to .75 instead of 1). • Cutoff approximates the chemical potential of the TF: can vary between experiments (but not the energy matrix) • Simple 2 test used to asess the overlap between the PBM and ChIP-chip distributions for each matrix element, i.e. test for consistency. • Most elements have overlapping distributions. Only 3 don’t, and those are outside of the main site. • Element by element match of mean and variance between the two analyses is impressive: No Free Parameters! • The error models of the two exp’ts are very different. We infer them from the data; that the same energy matrix is inferred is serious consistency check. NIH Center Retreat
PBM vs. ChIP-chip Binding Predictions • We declare sites flagged by > 50% of energy matrices to be putative binding sites. Can make it tighter. • Putative ChIP-chip sites are almost all predicted by the PBM matrices. As they should be! • Experimenters identify putative sites by cutoff. Overlap between ChIP-chip and PBM is poor. • Changing the cutoff will only admit more false positives: improvement only by decreasing expt’l noise. • By “understanding” the noise, we can flag more sites with little false positive penalty. General lesson is that noise can be “understood” if the data is “bottle-necked” through a “good” parametric model. That the difference between in vivo versus in vitro experiment is captured as noise is pleasant surprise. NIH Center Retreat
Evidence From Evolution • Direct experimental evidence about TF-DNA binding energy is limited. But energy is the focus of our method. We need direct hi-throughput energy measurements… • Functional properties of binding sites are governed in large part by energy: energy, not sequence, should be conserved in evolution if function is to be maintained • Suggests that comparing “orthologous” binding sites between species is an indirect way of assessing whether our energy model is doing the right thing. • If the model passes this test, our hundreds of binding sites are the raw material for a new kind of quantitative study of the dynamics of evolution. NIH Center Retreat
Binding Energy is Conserved (Yeast ABF1) Use the energy model derived by likelihood analysis from S.cerevisiae assays to assign energies to sites (an a priori genotype-phenotype map): • 676 (!) intergenic sites in S. cerevisiae with Abf1p binding energy < 1 have clear orthologs in S. bayanus • ~75% of these orthologs also have energy < 1. Conservation this strong is clearly not an accident! • Sites with energy > 1 have no correlation between energies in the two genomes: mutation randomizes energy. • Clear evidence of selection pressure on bound site energy; pressure on sequence is indirect. • Precise genotype-phenotype map is good starting point for a quantitative understanding of how TF binding sites and regulatory networks evolved. NIH Center Retreat
Energy Imprints Itself on Sequence Evolution We have a rich yeast phylogenetic tree. Plus a large collection of orthologous site pairs. Ask population average questions about the likelihood of base changes at diff’t locations in thte ABF1 site. Look for selection on the basis of energy ….not site-by-site, but on average in the population. Substitution probability pattern matches structure of energy matrix (bigger DE’s disfavored). Pattern evolves with time (from last common ancestor) as if under control of common `Hamiltonian’. ABF1 NIH Center Retreat
Comments and Conclusions • For lack of time I didn’t show you lots of nice stuff: hows ensemble are stable under bin size (m=20 to m=200 regions) and matrix width (L=14 to L=26 bp), and under discarding random halves of the data. • I also didn’t show you how it works for other TFs. We’ve looked at other “broad-acting” TFs (RAP1, REB1, ..) with “similar” results. • The opportunities for quantitative study of evolution with a large population of binding sites and a clean quantitative phenotype are pretty exciting. Its not data analysis, but you can’t resist. • That a minimalist energy model so accurately describes complex DNA binding data is a big surprise. Arbitrary sets of 100s of genes cannot be so regulated; how rare are large sets which could be? • Different binding sites have different affinities (for good reasons?). We make specific predictions about how affinities are ordered. How does this concord with biochemical reality (Maerkl/Quake)? • The output of this effort is meant to be input to an effort to construct proper stat mech models of interesting networks. Next step! NIH Center Retreat
F(E)) ABF1 site stats Pbkgd(E) Pfunc(E) “Equilibrium Thermodynamics” of Site Evolution Assume: a) site s1evolves over time into an ensemble {s1} thermal with respect to a fitness function F(E(s)) b) many equivalent sites [si] of a TF in one species can be treated as an equivalent ensemble (ergodicity?). Population genetics (Kimura): raw mutation rate m is modified by fitness change DF: Infer the actual fitness function F(E) from ABF1 site energy statistics in S.cerevisiae: Known background m determines dist’n P0(E) of non-func’l sites. Then Pfunc(E) = Ptrue(E) - P0(E) . F(E) = ln(Pfunc(E)) is the “Hamiltonian” governing stochastic evolution of ABF1 binding sites (Lassig-Mustonen) Do MC simulations agree with data provided by real genomes? NIH Center Retreat