Achim Tresch Computational Biology

‘Omics’ - Analysis of high dimensional Data Achim TreschComputational Biology

Regulation of gene expression protein Translation Localization Stability Post-transcriptional mRNA 3’UTR Transcriptional Pol II DNA Activation Repression

Regulation of gene expression • Where does each transcription factor bind in the genome, in each cell type, at a given time? Near which genes ? • What is the “cis-regulatory code” of each factor ? Does it require any co-factors ? DNA Activation Repression

Chromatin Immunoprecipitation (ChIP) Transcription factor of interest Antibody Sequencing

Chromatin Immunoprecipitation (ChIP) Control: input DNA Sequencing

Chromatin Immunoprecipitation (ChIP) Sonication 25-40bp ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTAATCACTTAAG Average length ~ 250bp

Chromatin Immunoprecipitation (ChIP) ChIP-Seq Analysis Workflow ELAND Bowtie SOAP SeqMap … FindPeaks CHiPSeq BS-Seq SISSRs QuEST MACS CisGenome … Alignment Peak Detection Motif Analysis Annotation Visualization

Read Alignment Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Read Alignment genome Read count T A T T A A T T A T C C C C A T A T A T G A T A T genome Expected read count Expected read count = total number of reads * extended fragment length / chr length

Read Alignment

Peak Detection • We need to correct for input DNA reads (control) • - non-uniformly distributed (form peaks too) - vastly different numbers of reads between ChIP and input • Calculate read count at each position (bp) in genome • Determine if read count is greater than expected

Peak Detection Is the observed read count at a given genomic position greater than expected ? Frequency x = observed read count λ = expected read count Read count The Poisson distribution

Peak Detection Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) genome P(X>=10) = 1.7 x 10-10 log10 P(X>=10) = -9.77 The Poisson distribution -log10 P(X>=10) = 9.77

Peak Detection Read count Expected read count -Log(p) Expected read count = total number of reads * extended fraglen / chrlen

Peak Detection Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len Input reads

Peak Detection ChIP INPUT Read count Read count Expected read count Expected read count Genome positions (bp) Genome positions (bp) -Log(Pc) -Log(Pi) Threshold Log(Pc) - Log(Pi)

Peak Detection Normalized Peak score (at each bp) P(XChIP) • Determine all genomic regions with R>=15 • Merge peaks separated by less than 100bp • Output all peaks with length >= 100b R = -log10 P(Xinput) Will detect peaks with high read counts in ChIP, low in Input

Peak Detection The constant rate assumption does not hold! Negative binomial model fits the data better! HongkaiJi et al. Nature Biotechnology 26: 1293-1300. 2008

Visualization ChIP reads Input reads Detected Peaks 80% are within <20kb of a known gene

Motif Search True TF binding peak? Yes Dependence is quantified using the mutual information Yes Target regions Yes True TF peak Yes Yes No Yes Yes … Absent Motif No Present No Random regions No No No No …

Highly informative k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265 ... ... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 MI=0.081 MI=0.045 MI=0.040 ... Not informative Motif Search

A/G C/G/T T/G A/T/G C/G A/C/G Motif Search Optimizing k-mers into more informative degenerate motifs True TF binding peak? ATCCGTACA Yes Yes Target regions Yes Yes Yes Yes … No No Random regions ATCC[C/G]TACA No No No which character increases the mutual information by the largest amount ? No …

Motif Search change

Motif Analysis Motif co-occurrence anallysis Discovered Motifs Enrichment Depletion

The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money

The ENCODE Project • 2 Tier 1 cell lines • GM12878 (B cell) • K562 (CML cells) • 5 Tier 2 cells • HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC • Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences

The ENCODE Project Lots of data and data types generated by RNA-seq RNA-array TF ChIP-seq Histonemodif ChIP-seq DNaseHS-seq Methyl-seq Methyl27-bisulfite 1M SNP genotyping

HMM segmentation PCA analysis Dynamic Bayesian Networks Integrative Data Analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Std. Peaks Region calls Active regions …… Biological interpretation

Integrative Data Analysis 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path 25-state HMM State E State C State F State I State A Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis

Metagene Analysis of RNA transcription initiation F pA Pol II B H Kin28 ChIP-chip profiles, averaged across ~300 expressed genes of medium length CTD Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

Metagene Analysis of RNA transcription promotor escape initiation nascent RNA F pA Pol II Pol II B H Kin28 5‘ S5P P CTD S7P P Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

Metagene Analysis of RNA transcription promotor escape elongation initiation Elf1 nascent RNA nascent RNA F pA Pol II Pol II Pol II B H Kin28 Spt4/5 * * m7G P 20 S5P CE P P CBP 80 S2P P Ctk1 CTD S7P P P Spt6 Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

Metagene Analysis of RNA transcription promotor escape elongation termination initiation Elf1 nascent RNA nascent RNA F pA Pol II Pol II Pol II Pol II B H Kin28 Spt4/5 * * m7G P P 20 S5P CE P P CBP P Pcf11 80 S2P P Ctk1 CTD S7P P P P * Spt6 Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

Metagene Analysis of RNA transcription promotor escape elongation termination initiation Elf1 F pA Pol II Pol II Pol II Pol II B H Kin28 Spt4/5 * * m7G P P 20 S5P CE P P CBP P Pcf11 80 S2P P Ctk1 CTD S7P P P P * Spt6 Is the sequence of binding, dissociation and modification events universal?

HMM Analysis of RNA transcription ChIP-chip occupancy profiles genomic position Ernst and Kellis (2012): ChromHMM: automating chromatin state discovery and characterization

HMM Analysis of RNA transcription ChIP-chip occupancy vectors

HMM Analysis of RNA transcription typical occupancy vector(s) state 3 state 1 state 2 transition matrix state 4 state 5

Textbook: Hidden Markov Models (HMMs) D: Data (occupancy vectors) • D2 • D2 • D1 • D1 • D3 • D3 X: Hidden (transcription) states • Ψ : Emission distributions ΨX1 ΨX1 ΨX2 ΨX2 ΨX3 ΨX3 Γ : Transition probabilities • X1 • X1 • X2 • X2 • X3 • X3 [less important: P(X1): Initial state distribution] ΓX2X3 ΓX1X2 genomic position Likelihood: Decoding: Viterbi algorithm Parameter Learning: Baum-Welch algorithm

Results on the S.cerevisiae data set Viterbi paths genes transcription start site

Results on the S.cerevisiae data set 2 5 1 Productive elongation Elf1, Ser2P high Initiation- elongation transition Nucl. high Ser2P low Initiation state: TFIIB high Nucl., Spt5, Ser2P, Elf1 low 8 Untranscribed regionsall low except Nucl. Termination Pcf11 high 8

Results on the S.cerevisiae data set initiation-elongation Observation: The transition matrix is almost symmetric, due to transcription in forward and reverse direction transition graph initiation transition matrix termination productive elongation early elongation intergenic/untranscribed

Sense vs. antisense transcription ChIP-chip tracks(multivariate Gaussian emissions) • D1 • D4 • D3 • D2 • D6 • D5 • X1 • X2 • X4 • X3 • X5 • X6 transcriptannotation Transcription on Crick strand Transcription on Watson strand Transcrpt. onCrick strand

The bidirectional Hidden Markov Model Additional constraint 1: Corresponding Watson and Crick states have identical emission distributions Ψ1 Ψ2 . . .Ψk Ψk . . . Ψ2Ψ1 “Watson“ transcription states Additional constraint 2: Γ12= P(Xt+1=Ψ2 | Xt=Ψ1 ) = P(Xt=Ψ1 |Xt+1=Ψ2) = Γ21• P(Xt=Ψ1 ) / P(Xt=Ψ2 ) „Crick“ transcription states Intergenic state

State transitions reflect biochemichal transitions 8 2 1 4 7 5 9 standard transcription Mayer et al. (2010): transition from initiation to elongation at +150bp 10 untranscribed genes

Different transcription cycles ?! 8 8 2 2 1 1 4 3 7 5 9 standard transcription (stepwise recruitment) highly transcribed genes (immediate recruitment)

A grammar of transcription • very low synthesis rate • high decay rate • Enrichment of stress response genes P I • low synthesis rate • very high decay rate • enrichment of genes involved in epigenetic regulation of gene expression, cell cycle PE EE2 EE1 E1 P T • medium synthesis rate • medium decay rate • Enrichment of genes involved in reproduction PE EE2 EE1 E2 P T • high synthesis rate • low decay rate • Enrichment of genes involved in ribosome biogenesis, rRNA processing PPE E3 P T

A grammar of transcription • medium synthesis rate • medium decay rate • Enrichment of genes involved ijn G1 phase of cell cycle PE EE1 EE2 P T • very high synthesis rate • Very low decay rate • Enrichment of ribosomal protein genes, intron containing genes P T PPE

Annotation of bidirectional promoters Viterbi sequence from directional HMM 696 bidirectional promoters ...........-pE-pE-pE-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-eE2--eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-.............. Text search (Regular Expression) PE- P+ P- PE+

Annotation of 45 unknown transcripts Strand-specific transcription data from Xu et al., Nature 2009 stable transcripts on the - strand cryptic transcripts on the - strand Viterbi sequence from directional HMM two new transcripts

Achim Tresch Computational Biology