350 likes | 506 Views
Biophysics and Bioinformatics of Transcription Regulation. Marko Djordjevic Dept. of Physics, Pupin Labs., Columbia U. Outline Part I: Biophysics approach to transcription factor binding site discovery . Part II: Quantitative analysis of bacteriophage gene expression strategies.
E N D
Biophysics and Bioinformatics of Transcription Regulation Marko Djordjevic Dept. of Physics, Pupin Labs., Columbia U.
Outline • Part I: Biophysics approach to transcription factor binding site discovery. • Part II: Quantitative analysis of bacteriophage gene expressionstrategies.
PART I:Transcription Factor binding site identification • Introduction to transcription regulation • Model for T.F.-DNA interaction • Biophysics based algorithm • Application to E. Coli T.F. binding sites • Comparison with information theory algorithm • Conclusion for Part I
Control of Gene Expression by Transcription Factors: lac Operon lacZ OFF OFF OFF ON Alberts et al, Molecular Biology of the cell.
Half sites for a CAP dimer Examples of CAP factor binding sites attcgtgatagctgtcgtaaagttttgttacctgcctctaacttaagtgtgacgccgtgcaaataatgccgtgattatagacacttttatttgcgatgcgtcgcgcattttaatgagattcagatcacatat taatgtgacgtcctttgcatacgaaggcgacctgggtcatgctgaggtgttaaattgatcacgttt similar “words” From experiment, we know some, but NOTall binding sites for a given transcription factor. Can we predict ALL of them?
Biophysical model of T.F.-DNA interactions Probability of DNA segment S to bind a protein c: with chemical potential m = kBT ln ( [c]/K ) Note:p(S) is Fermy function Saturation of binding. To parameterize binding energy, we use independent nucleotide approximation. A.Sengupta, M.Djordjevic and B.I.Shraiman, PNAS 2002
Experimentally known binding sites we now look at this Algorithm Compute eia andm Find all transcription factor binding sites in genome What we want to do? (M.Djordjevic, A. M. Sengupta and B. I. Shraiman, Gen Res. 2003)
Problem: Mix protein with set containing all genomic sequences of length L. Than extract and sequence some of the DNA sequences bound by the factor. Given set of extracted sequences {Sk}, determine energy matrix and chemical potential. Solution: Maximize the likelihood Λ of having all{Sk}bound (at chemical potentialm) and extracted, and none of the other sequences. and
(σ/m)2minimized Examples 2σ a) b) Quadratic programming Quadratic Programming (QP) Algorithm • In T0 (“all-or-none”) approximation to p(S): • all examples Sk bound by T.F. • b) number of bound random S’s is minimized
Application to transcription factorbinding site identification in E. coli Start with known sites for 50 T.F. (and RNA polymerase) in DPInteract database (Church lab) Use QP algorithm to definee, m for each factor Identify all (intergenic) DNA segments S satisfyingEe(S) < mfor eache, m set
Empirical distribution of estimated E m Background m
Sample results for pleiotropic factors
(e.g. CAP activators 40%) RNAP (RpoD) site statistics Note:50% false negatives for promoter prediction, can be reduced by combining with activator-TF search
MaximizeL with respect to q Physically, equivalent to: Information-theoretic weight matrix Natural threshold does not exist This is not correct binding probability. Saturation effects are not properly described.
False negative/positive trade-off curve (based on comparison with RegulonDB) m
“Class II” activators repressors “Class I” predicted Prediction of binding site modality e.g. CAP – a dual function transcription factor: Based on position of predicted CAP sites relative to predicted promoters (i.e.RNAP sites)
Part I: Conclusion Thinking physically about protein-DNA recognition lead to a new and improved bio-informatic algorithm… The algorithm is designed to correctly estimateeia,m given data on protein binding to oligos, under controlled conditions (i.e. a “bio-physical” experiment). • For bio-informatic data the algorithm provides: • Rational choice of a binding thresholdm • 2) Minimization of expected FALSE POSITIVES. Note:Information-theoretic weight matrix approach does not estimate the threshold score for binding.
Acknowledgements: Anirvan M. Sengupta (Dept. of Physics, Rutgers U.) Boris I. Shraiman (KITP, UCSB)
Xp10 bacteriophage gene expression strategy Marko Djordjevic Columbia U., Department of Physics, Pupin Labs
Overview • Introduction to bacteriophage biology • Motivation • Experimental setup • Quantitative data analysis • Bioinformatic analysis • Kinetic modeling • Conclusion and more general context
Lytic bacteriophage Xp10 infects Xanthomonas oryzae. Bacterium Xanthomonas oryzae causes bacterial leaf blight, a serious disease of rice. From: Yuzenkova et al., J. Mol.Biol. (2003) 330, 735-748 Photo from: Mueller, K.E. 1983. Field Problems of Tropical Rice
TR5 TR1 TR4 TR2 TR3 53L 05R 13R 15R 51L 19R 23R 27R 29R 33aL 07R 11R 17R 28R 38L 03R 06R 08R 09R 14R 26R 18R 31L 32L 36L 40L 22R 34L 41L 43L 49L 21R 16R 01R 04R 10R 20R 59R 57R 56L 30L 52L 58R 25L 45L 42L 54L R genes L genes P1 P2 P3 structural genes P1 P2 HNH endonuclease genes -10/-35 host lysis gene P3 ext -10 viral DNA replication genes Xp10 RNAP gene (32L) p7 gene (45L) Xp10 genome • p7functions: • Inhibits transcription from most host RNAP promoters • Acts as anti-termination protein Xp10 genome organization and anti-termination mechanism reminds to λ phage, which uses only host RNAP However, this view leaves no role for Xp10 RNAP!
Why Xp10 RNAP is needed? Have to understand Xp10 transcription strategy. Quantitative understanding? Well defined problem! Find transcription activities (amount of transcript generated per unit time) of both RNAP, for all Xp10 genes, through whole infection Motivation
Scheme of an experiment X. oryzae infected with Xp10 t- time post-infection: 0, 1, 3, 5, 7, 10, 15, 20, 25, 40, and 60 minutes RNA isolation Reverse transcription with random hexamers 32P-labeled cDNA probes Hybridization to macroarray 30 Xp10-gene fragments + control spots + Rifampicin (drug that inhibits bacterial RNAP) 20-minute incubation Quantitatively measured transcript abundances.
Average abundance of Xp10 genes belonging to different expression classes. Clustering of genes in different expression classes. Abundance of the total Xp10 RNA and of the X. oryzaerpoC mRNA as a function of time. 53L 55L 13R 15R 45L 51L 59R 05R 19R 23R 25L 27R 29R 33aL 28R 17R 18R 06R 07R 34L 40L 41L 43L 08R 09R 22R 26R 31L 32L 36L 38L 49L 21R 16R 02R 04R 10R 12R 20R 30L 52L 58R 42L 44L 46L 54L 57R 56L Quantitative data analysis (E. Semenova, M.Djordjevic, B. Shraiman and K. Severinov, in press, Mol. Microbiol.)
Transcript kinetic analysis when bacterial RNAP is inhibited (t) = transcript’s abundance (t + 20 min Rifampicin) - transcript’s abundance (t) Xp10 RNAP has to transcribe R genes. L genes transcribed exclusively by bacterial RNAP. Determine half lives of L transcripts.
PL2 PL1 P3 Pup 56L 57R 43 bp 43 bp Pup P3 PL1 PL2 C T A G PrE G A T C PrE G A T C PrE G A T C PrE 42253 42347 42611 Pup PL1 PL2 43142 P3 42639 PL1’ PL2’ Bioinformatic search for Xp10 promoters Promoters predicted by MLSA and QPMEME algorithms: (QPMEME: M. Djordjevic, A. Sengupta and B. Shraiman, Gen. Res. 13 (2003)) Experimental verification:
5 + 5 20 25 What are contributions of two RNA polymerases to R gene transcription activity Measured transcript abundances: (M.Djordjevic, E. Semenova, B.I.Shraiman and K. Severinov, in preparation) 56L 5R + • Kinetic model assumptions: • Anti-termination efficiency given by: • n0 and n are unknown constants, different for two RNA polymerases • Proteins are stabile on the time scale of infection +
Estimating transcription activities Transcription activity from transcript abundance Early R genes Late R genes
Modeling results Transcription of early R genes Transcription of L genes Transcription of late R genes
Is transcription by both RNA polymerases necessary for phage viability ? Experiment: If Rif is added at 15 min, progeny amount reduces by 70%. Our prediction from estimated transcription activities: Transcription of R-genes by both host and phage RNAP is necessary for phage viability!
What we have learned about Xp10? • We identified promoters recognized by Xp10 RNAP. This was a non-trivial problem! • The joint transcription of the same set of genes by two types of RNA polymerases is unprecedented for a bacteriophage (but it occurs in chloroplasts)! • Our results strongly suggest that transcription of R genes by both RNA polymerases is necessary for phage viability.
On more general level • We argue thatmicro-array experiment,coupled toquantitative data analysis,bioinformaticsandkinetic modeling,presents an efficient way to analyze phage transcription strategy. • Introducequantitative methodsof dataanalysis, that may be used to study gene expression strategies ofnovel viruses.
Acknowledgements: Ekaterina Semenova (Waksman Institute, Rutgers U.) Boris Shraiman (KITP, UCSB) Konstantin Severinov (Waksman Institute, Dept. of Molecular Biology and Biochemistry, Rutgers U.)