500 likes | 762 Views
More on TF Motif Finding ChIP-chip / seq. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. De novo Sequence Motif Finding. Goal: look for common sequence patterns enriched in the input data (compared to the genome background) Regular expression enumeration Pattern driven approach
E N D
More on TF Motif Finding ChIP-chip / seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
De novo Sequence Motif Finding • Goal: look for common sequence patterns enriched in the input data (compared to the genome background) • Regular expression enumeration • Pattern driven approach • Enumerate patterns, check significance in dataset • Oligonucleotide analysis, MobyDick • Position weight matrix update • Data driven approach, use data to refine motifs • Consensus, EM & Gibbs sampling • Motif score and Markov background
Position Weight Matrix Update • Advantage • Can look for motifs of any widths • Flexible with base substitutions • Disadvantage: • EM and Gibbs sampling: no guaranteed convergence time • No guaranteed global optimum
Motif Finding in Bacteria • Promoter sequences are short (200-300 bp) • Motif are usually long (10-20 bases) • Some have two blocks with a gap, some are palindromes • Long motifs are usually very degenerate • Single microarray experiment sometimes already provides enough information to search for TF motifs
Motif Finding in Lower Eukaryotes • Upstream sequences longer (500-1000 bp), with some simple repeats • Motif width varies (5 – 17 bases) • Expression clusters provide decent input sequences quality for TF motif finding • Motif combination and redundancy appears, although single motifs are usually significant enough for identification
Yeast Promoter Architecture • Co-occurring regulators suggest physical interaction between the regulators
Motif Finding in Higher Eukaryotes • Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream • Motifs can be short or long (6-20 bases), and appear in combination and clusters • Gene expression cluster not good enough input • Need: • Comparative Genomics: phastcons score • Motif modules: motif clusters • ChIP-chip/seq
UCSC PhastCons Conservation • Functional regulatory sequences are under stronger evolutionary constraint • Align orthologous sequences together • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC
Conserved Motif Clusters • First find conserved regions in the genome • Then look for repeated transcription factors (TF) binding sites • They form transcription factor modules
Outline • ChIP-chip on yeast • Technology and data analysis: MDscan motif finding, regulatory network • ChIP-X on human • Tiling microarrays and peak finding • High throughput sequencing and peak finding • Data analysis and examples • Analysis: peak finding, gene expression analysis, sequence motif finding, regulatory network • Holistic picture of gene regulation
Motivation • Motif finding works well in bacteria, OK in yeast, marginal in worm/fly, and almost never in mammals • Cistrome: Genome-wide in vivo binding sites of DNA-binding proteins • ChIP-chip and ChIP-seq gives cistrome results
ChIP-chip Technology • Chromatin ImmunoPrecipitation + microarray • ChIP-on-chip or ChIP-chip • Also known as Genome Scale Location Analysis • Detect genome-wide in vivo location of TF and other DNA-binding proteins • Find all the DNA sequences bound by TF-X? • Cook all the dishes with cinnamon • Can learn the regulatory mechanism of a transcription factor or DNA-binding protein much better and faster
Promoter Array Hybridization Genes Intergenetic ChIP
ChIP-DNA chip Detection • Started in yeast, use promoter cDNA microarray • ~ 6000 spots, each 800-1000 bp • Two color assay • Control: no antibody, or chromatin (a little bit of everything) • Need triplicates to cancel noise • Applied to all yeast TFs • TF modified to contain a tag • Tag can be precipitated with Immunoglobin
ChIP-chip Motif Finding • ChIP-chip gives 10-5000 binding regions ~600-1000bp long. Precise binding motif? • Raw data is like perfect clustering, plus enrichment values • MDscan • High ChIP ranking => true targets, contain more sites • Search TF motif from highest ranking targets first (high signal / background ratio) • Refine candidate motifs with all targets • Used successfully in ChIP-chip motif finding
m-matches for TGTAACGT Similarity Defined by m-match For a given w-mer and any other random w-mer TGTAACGT 8-mer TGTAACGT matched 8 AGTAACGT matched 7 TGCAACAT matched 6 TGACACGG matched 5 AATAACAG matched 4 Pick a reasonable m to call two w-mers similar
A 9-mer ATTGCAAAT Higher enrichment TTTGCGAAT TTGCAAATC Seed motif pattern ChIP-chip selected upstream sequences ATTGCAAAT TTTGCGAAT TTTGCAAAT GCCACCGT ACCACCGT ACCACGGT GCCACGGC … GCAAATCCA GCAAATTCG GCAAATCCA GGAAATCCA GGAAATCCT TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC TTTGCAAAT CAAATCCAA CAAATCCAA GAAATCCAC TGCAAATCC TGCAAATTC MDscan Seeds
Seed1 m-matches Update Motifs With Remaining Seqs Extreme High Rank All ChIP-selected targets
Seed1 m-matches Refine the Motifs Extreme High Rank All ChIP-selected targets
Yeast TF Regulatory Network Protein Transcribe Regulate Gene
Ndt80 & Sum1 regulated genes ChIP-chip Better Explains Expression Sum1 regulated genes Ndt80 regulated genes
Tiling Probes Genome Tiling Microarrays • Promoter array doesn’t work for human ChIP-chip • Binding could appear in much further intergenic sequences, introns, exons, or downstream sequences. Genomic DNA on the chromosome
ChIP Ctrl Chromosome ChIP-chip on Tiling Microarray ChIP-DNA Noise
ChIP-chip • Detect genome-wide location of transcription and epigenetic factors • Affymetrix genome tiling arrays are cheaper • $2000 7 arrays * 6 million probes * (3 ChIP + 3 Ctrl) • But data is noisier and less informative • Two peaks? How about ChIP alone? Over 42M probes? ChIP Log Probe Intensity Ctrl Chromosome Coordinates
ChIP-chip AnalysisMann-Whitney U-test • Affy TAS, Cawley et al (Cell 2004): • Assign 1 to all probe pairs with MM > PM • Each probe: rank probes within [-500bp, +500bp] window
Histogram of (PM – MM) PM – MM ChIP-chip AnalysisMann-Whitney U-test • Affy TAS, Cawley et al (Cell 2004): • Assign 1 to all probe pairs with MM > PM • Each probe: rank probes within [-500bp, +500bp] window • Check whether sum of ChIP ranks is much smaller • Consider all probes equally • Half of the probes have MM > PM
Affymetrix Tiling Array Peak Finding • Challenges: • Massive data, probe values noisy • Only 1/3 of researchers get it to work the first time • Previous algorithms only work by comparing 3 ChIP with 3 Ctrl • Model-based Analysis of Tiling arrays (MAT) • Work with single ChIP (no rep, no ctrl) • Find individual failed samples • More sensitive, specific, and quantitative with 3 ChIP & 3 Ctrl MAT: Johnson et al, PNAS 2006
MAT • Most of the probes in ChIP-chip measures non-specific hybridization and background noise • Estimate probe behavior by checking other probes with similar sequence on the same array • Probe sequence plays a big role in signal value
Model Sequence-Specific Probe Effect • First detailed model of probe sequence on probe signal • AATGC ACTGT GCACA GATCG GCCAT 7 A, 7 C, 6 G, 5 T, map to 2 places in genome • Use all the probes on the array to estimate the parameters Position-specific A, C, G effect Probe signal # of T’s intercept A,C,G,T count squared 25-mer copy number
6M Probes 2K bins Observed probe intensity Model predicted probe intensity Observed probevariance within eachbin Probe Standardization • Fit the probe model array by array
Raw probe values at two spike-in regions with concentration 2X 2X 2X ChIP Ctrl Sequence-based probe behavior standardization ChIP standardized Ctrl standardized Window-based neighboring probe combination for ChIP-region detection ChIP Window (ChIP – Ctrl) (3 ChIP – 3 Ctrl)
MA2C: Model-based for 2-Color Arrays • Normalize probes by GC bins within each array • How much variance is observed in the GC bin • Give high confidence probes more weight • Running window average or median for peak finding MA2C: Song et al, Genome Biol 2007
Is a ChIP experiment working? • MAT window scores ~ normal with long tails • Estimate pvalue of normal from left half of data • FDR = A / B (Ctrl/ChIP peaks are all FPs) • Spike-in shows MAT FDR estimate is accurate • Can find individual failed replicate A B
ChIP-Seq ChIP-DNA Noise Map 30-mers back to the genome Sequence millions of 30-mer ends of fragments
Binding MACS: Model-based Analysis for ChIP-Seq • Use confident peaks to model shift size
Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias
Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias • 200-300bp control windows have to few tags • But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008
CEAS:Cis-regulatory Element Annotation System • Data Analysis Button for Biologists http://ceas.cbi.pku.edu.cn
ER TF?? Estrogen Receptor • Carroll et al, Cell 2005 • Overactive in > 70% of breast cancers • Where does it go in the genome? • ChIP-chip on chr21/22, motif and expression analysis found its partner FoxA1
ER AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators NRIP
ER NRIP AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators