470 likes | 702 Views
ChIP-seq. Xiaole Shirley Liu STAT115, STAT215. Outline. ChIP-chip on yeast Technology and data analysis : MDscan motif finding, regulatory network ChIP-X on human Tiling microarrays and peak finding High throughput sequencing and peak finding Data analysis and examples
E N D
ChIP-seq Xiaole Shirley Liu STAT115, STAT215
Outline • ChIP-chip on yeast • Technology and data analysis: MDscan motif finding, regulatory network • ChIP-X on human • Tiling microarrays and peak finding • High throughput sequencing and peak finding • Data analysis and examples • Analysis: peak finding, gene expression analysis, sequence motif finding, regulatory network • Holistic picture of gene regulation
Motivation • Motif finding works well in bacteria, OK in yeast, marginal in worm/fly, and almost never in mammals • Cistrome: Genome-wide in vivo binding sites of DNA-binding proteins • ChIP-chip and ChIP-seq gives cistrome results
ChIP-chip Technology • Chromatin ImmunoPrecipitation + microarray • ChIP-on-chip or ChIP-chip • Also known as Genome Scale Location Analysis • Detect genome-wide in vivo location of TF and other DNA-binding proteins • Find all the DNA sequences bound by TF-X? • Cook all the dishes with cinnamon • Can learn the regulatory mechanism of a transcription factor or DNA-binding protein much better and faster
Promoter Array Hybridization Genes Intergenetic ChIP
ChIP-DNA chip Detection • Started in yeast, use promoter cDNA microarray • ~ 6000 spots, each 800-1000 bp • Two color assay • Control: no antibody, or chromatin (a little bit of everything) • Need triplicates to cancel noise • Applied to all yeast TFs • TF modified to contain a tag • Tag can be precipitated with Immunoglobin
ChIP-chip Motif Finding • ChIP-chip gives 10-5000 binding regions ~600-1000bp long. Precise binding motif? • Raw data is like perfect clustering, plus enrichment values • MDscan • High ChIP ranking => true targets, contain more sites • Search TF motif from highest ranking targets first (high signal / background ratio) • Refine candidate motifs with all targets • Used successfully in ChIP-chip motif finding
m-matches for TGTAACGT Similarity Defined by m-match For a given w-mer and any other random w-mer TGTAACGT 8-mer TGTAACGT matched 8 AGTAACGT matched 7 TGCAACAT matched 6 TGACACGG matched 5 AATAACAG matched 4 Pick a reasonable m to call two w-mers similar
A 9-mer ATTGCAAAT Higher enrichment TTTGCGAAT TTGCAAATC Seed motif pattern ChIP-chip selected upstream sequences ATTGCAAAT TTTGCGAAT TTTGCAAAT GCCACCGT ACCACCGT ACCACGGT GCCACGGC … GCAAATCCA GCAAATTCG GCAAATCCA GGAAATCCA GGAAATCCT TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC TTTGCAAAT CAAATCCAA CAAATCCAA GAAATCCAC TGCAAATCC TGCAAATTC MDscan Seeds
Seed1 m-matches Update Motifs With Remaining Seqs Extreme High Rank All ChIP-selected targets
Seed1 m-matches Refine the Motifs Extreme High Rank All ChIP-selected targets
Look for candidate motifs Refine motifs Regress b/t upstream mtf match score and downstream expression Motif Regressor • EM Conlon, XS Liu MDscan Expression log ratio Genes
Motif Regressor Rational • For each TF: Upstream Downstream Seq Mtf Match Gene Exp Gene1 3.2 1.8 Gene2 2.8 0.3 Gene3… • Upstream sequence X motif matching score measures: • Number of sites • Strength of matching
Motif Regressor Strategy • Rank genes by log2 (expression fold change) • Try MDscan (width 5-17) on induced and repressed genes separately • Find 50 candidate motifs from top 100 genes • Refine candidate motifs with top 500 genes • Report <= 30 distinct motifs • Score each upstream sequence with each motif • Linear regression to eliminate insignificant motifs
Linear Regression Example Person IQ Age Education Height Eye color Spend/week # of CD A 120 30 High 171 blue $4000 30 B 250 41 PhD 155 brown $1500 18 C 150 8 Grade10 115 black $100 90 D 180 16 Grade12 140 gray $200 15 E 90 4 Preschool 88 green $500 26 F 130 17 High 178 black $80 500 G 110 21 College 182 blue $800 220 … Gene Express Mtf1 Mtf2 Mtf3 Mtf4 Mtf5 Mtf6 Single X X X -- -- -- Regression
Yeast TF Regulatory Network Protein Transcribe Regulate Gene
Ndt80 & Sum1 regulated genes ChIP-chip Better Explains Expression Sum1 regulated genes Ndt80 regulated genes
Tiling Probes Genome Tiling Microarrays • Promoter array doesn’t work for human ChIP-chip • Binding could appear in much further intergenic sequences, introns, exons, or downstream sequences. Genomic DNA on the chromosome
ChIP Ctrl Chromosome ChIP-chip on Tiling Microarray ChIP-DNA Noise
ChIP-chip • Detect genome-wide location of transcription and epigenetic factors • Affymetrix genome tiling arrays are cheaper • $2000 7 arrays * 6 million probes * (3 ChIP + 3 Ctrl) • But data is noisier and less informative • Two peaks? How about ChIP alone? Over 42M probes? ChIP Log Probe Intensity Ctrl Chromosome Coordinates
Affymetrix Tiling Array Peak Finding • Challenges: • Massive data, probe values noisy • Only 1/3 of researchers get it to work the first time • Previous algorithms only work by comparing 3 ChIP with 3 Ctrl • Model-based Analysis of Tiling arrays (MAT) • Work with single ChIP (no rep, no ctrl) • Find individual failed samples • More sensitive, specific, and quantitative with 3 ChIP & 3 Ctrl MAT: Johnson et al, PNAS 2006
MAT • Most of the probes in ChIP-chip measures non-specific hybridization and background noise • Estimate probe behavior by checking other probes with similar sequence on the same array • Probe sequence plays a big role in signal value
Model Sequence-Specific Probe Effect • First detailed model of probe sequence on probe signal • AATGC ACTGT GCACA GATCG GCCAT 7 A, 7 C, 6 G, 5 T, map to 2 places in genome • Use all the probes on the array to estimate the parameters Position-specific A, C, G effect Probe signal # of T’s intercept A,C,G,T count squared 25-mer copy number
6M Probes 2K bins Observed probe intensity Model predicted probe intensity Observed probevariance within eachbin Probe Standardization • Fit the probe model array by array
Raw probe values at two spike-in regions with concentration 2X 2X 2X ChIP Ctrl Sequence-based probe behavior standardization ChIP standardized Ctrl standardized Window-based neighboring probe combination for ChIP-region detection ChIP Window (ChIP – Ctrl) (3 ChIP – 3 Ctrl)
Is a ChIP experiment working? • MAT window scores ~ normal with long tails • Estimate pvalue of normal from left half of data • FDR = A / B (Ctrl/ChIP peaks are all FPs) • Spike-in shows MAT FDR estimate is accurate • Can find individual failed replicate A B
ChIP-Seq ChIP-DNA Noise Map 30-mers back to the genome Sequence millions of 30-mer ends of fragments
Binding MACS: Model-based Analysis for ChIP-Seq • Use confident peaks to model shift size
Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias
Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias • 200-300bp control windows have to few tags • But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008
Cis-tr-ome: integrated analysis pipeline and data collection Liu et al, Genome Biol 2011
http://cistrome.org/ap/ • Work for hg19, mm9, ce6, and dm4 • ChIP-chip / seq peak calling • Checking corr and overlap • Visualize signal across different elements • Annotate nearby genes • Motif analysis • Conservation analysis, lift over from one genome to another • Heatmap and clustering of many factors • Gene expression profiling analysis Liu et al, Genome Biol 2011
ER TF?? Estrogen Receptor • Carroll et al, Cell 2005 • Overactive in > 70% of breast cancers • Where does it go in the genome? • ChIP-chip on chr21/22, motif and expression analysis found its partner FoxA1
ER AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators NRIP
ER NRIP AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators