More on TF Motif Finding ChIP-chip / seq

More on TF Motif Finding ChIP-chip / seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

De novo Sequence Motif Finding • Goal: look for common sequence patterns enriched in the input data (compared to the genome background) • Regular expression enumeration • Pattern driven approach • Enumerate patterns, check significance in dataset • Oligonucleotide analysis, MobyDick • Position weight matrix update • Data driven approach, use data to refine motifs • Consensus, EM & Gibbs sampling • Motif score and Markov background

Position Weight Matrix Update • Advantage • Can look for motifs of any widths • Flexible with base substitutions • Disadvantage: • EM and Gibbs sampling: no guaranteed convergence time • No guaranteed global optimum

Motif Finding in Bacteria • Promoter sequences are short (200-300 bp) • Motif are usually long (10-20 bases) • Some have two blocks with a gap, some are palindromes • Long motifs are usually very degenerate • Single microarray experiment sometimes already provides enough information to search for TF motifs

Motif Finding in Lower Eukaryotes • Upstream sequences longer (500-1000 bp), with some simple repeats • Motif width varies (5 – 17 bases) • Expression clusters provide decent input sequences quality for TF motif finding • Motif combination and redundancy appears, although single motifs are usually significant enough for identification

Yeast Promoter Architecture • Co-occurring regulators suggest physical interaction between the regulators

Motif Finding in Higher Eukaryotes • Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream • Motifs can be short or long (6-20 bases), and appear in combination and clusters • Gene expression cluster not good enough input • Need: • Comparative Genomics: phastcons score • Motif modules: motif clusters • ChIP-chip/seq

Yeast Regulatory Sequence Conservation

UCSC PhastCons Conservation • Functional regulatory sequences are under stronger evolutionary constraint • Align orthologous sequences together • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC

Conserved Motif Clusters • First find conserved regions in the genome • Then look for repeated transcription factors (TF) binding sites • They form transcription factor modules

Outline • ChIP-chip on yeast • Technology and data analysis: MDscan motif finding, regulatory network • ChIP-X on human • Tiling microarrays and peak finding • High throughput sequencing and peak finding • Data analysis and examples • Analysis: peak finding, gene expression analysis, sequence motif finding, regulatory network • Holistic picture of gene regulation

Motivation • Motif finding works well in bacteria, OK in yeast, marginal in worm/fly, and almost never in mammals • Cistrome: Genome-wide in vivo binding sites of DNA-binding proteins • ChIP-chip and ChIP-seq gives cistrome results

ChIP-chip Technology • Chromatin ImmunoPrecipitation + microarray • ChIP-on-chip or ChIP-chip • Also known as Genome Scale Location Analysis • Detect genome-wide in vivo location of TF and other DNA-binding proteins • Find all the DNA sequences bound by TF-X? • Cook all the dishes with cinnamon • Can learn the regulatory mechanism of a transcription factor or DNA-binding protein much better and faster

Chromatin ImmunoPrecipitation (ChIP)

TF/DNA Crosslinking in vivo

Sonication (~500bp)

TF-specific Antibody

Immunoprecipitation

Reverse Crosslink and DNA Purification

Promoter Array Hybridization Genes Intergenetic ChIP

ChIP-DNA chip Detection • Started in yeast, use promoter cDNA microarray • ~ 6000 spots, each 800-1000 bp • Two color assay • Control: no antibody, or chromatin (a little bit of everything) • Need triplicates to cancel noise • Applied to all yeast TFs • TF modified to contain a tag • Tag can be precipitated with Immunoglobin

ChIP-chip Motif Finding • ChIP-chip gives 10-5000 binding regions ~600-1000bp long. Precise binding motif? • Raw data is like perfect clustering, plus enrichment values • MDscan • High ChIP ranking => true targets, contain more sites • Search TF motif from highest ranking targets first (high signal / background ratio) • Refine candidate motifs with all targets • Used successfully in ChIP-chip motif finding

m-matches for TGTAACGT Similarity Defined by m-match For a given w-mer and any other random w-mer TGTAACGT 8-mer TGTAACGT matched 8 AGTAACGT matched 7 TGCAACAT matched 6 TGACACGG matched 5 AATAACAG matched 4 Pick a reasonable m to call two w-mers similar

A 9-mer ATTGCAAAT Higher enrichment TTTGCGAAT TTGCAAATC Seed motif pattern ChIP-chip selected upstream sequences ATTGCAAAT TTTGCGAAT TTTGCAAAT GCCACCGT ACCACCGT ACCACGGT GCCACGGC … GCAAATCCA GCAAATTCG GCAAATCCA GGAAATCCA GGAAATCCT TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC TTTGCAAAT CAAATCCAA CAAATCCAA GAAATCCAC TGCAAATCC TGCAAATTC MDscan Seeds

Seed1 m-matches Update Motifs With Remaining Seqs Extreme High Rank All ChIP-selected targets

Seed1 m-matches Refine the Motifs Extreme High Rank All ChIP-selected targets

Yeast TF Regulatory Network Protein Transcribe Regulate Gene

Ndt80 & Sum1 regulated genes ChIP-chip Better Explains Expression Sum1 regulated genes Ndt80 regulated genes

Tiling Probes Genome Tiling Microarrays • Promoter array doesn’t work for human ChIP-chip • Binding could appear in much further intergenic sequences, introns, exons, or downstream sequences. Genomic DNA on the chromosome

DNA Purification

ChIP Ctrl Chromosome ChIP-chip on Tiling Microarray ChIP-DNA Noise

ChIP-chip • Detect genome-wide location of transcription and epigenetic factors • Affymetrix genome tiling arrays are cheaper • $2000 7 arrays * 6 million probes * (3 ChIP + 3 Ctrl) • But data is noisier and less informative • Two peaks? How about ChIP alone? Over 42M probes? ChIP Log Probe Intensity Ctrl Chromosome Coordinates

ChIP-chip AnalysisMann-Whitney U-test • Affy TAS, Cawley et al (Cell 2004): • Assign 1 to all probe pairs with MM > PM • Each probe: rank probes within [-500bp, +500bp] window

Histogram of (PM – MM) PM – MM ChIP-chip AnalysisMann-Whitney U-test • Affy TAS, Cawley et al (Cell 2004): • Assign 1 to all probe pairs with MM > PM • Each probe: rank probes within [-500bp, +500bp] window • Check whether sum of ChIP ranks is much smaller • Consider all probes equally • Half of the probes have MM > PM

Affymetrix Tiling Array Peak Finding • Challenges: • Massive data, probe values noisy • Only 1/3 of researchers get it to work the first time • Previous algorithms only work by comparing 3 ChIP with 3 Ctrl • Model-based Analysis of Tiling arrays (MAT) • Work with single ChIP (no rep, no ctrl) • Find individual failed samples • More sensitive, specific, and quantitative with 3 ChIP & 3 Ctrl MAT: Johnson et al, PNAS 2006

MAT • Most of the probes in ChIP-chip measures non-specific hybridization and background noise • Estimate probe behavior by checking other probes with similar sequence on the same array • Probe sequence plays a big role in signal value

Model Sequence-Specific Probe Effect • First detailed model of probe sequence on probe signal • AATGC ACTGT GCACA GATCG GCCAT 7 A, 7 C, 6 G, 5 T, map to 2 places in genome • Use all the probes on the array to estimate the parameters Position-specific A, C, G effect Probe signal # of T’s intercept A,C,G,T count squared 25-mer copy number

6M Probes 2K bins Observed probe intensity Model predicted probe intensity Observed probevariance within eachbin Probe Standardization • Fit the probe model array by array

Raw probe values at two spike-in regions with concentration 2X 2X 2X ChIP Ctrl Sequence-based probe behavior standardization ChIP standardized Ctrl standardized Window-based neighboring probe combination for ChIP-region detection ChIP Window (ChIP – Ctrl) (3 ChIP – 3 Ctrl)

MA2C: Model-based for 2-Color Arrays • Normalize probes by GC bins within each array • How much variance is observed in the GC bin • Give high confidence probes more weight • Running window average or median for peak finding MA2C: Song et al, Genome Biol 2007

Is a ChIP experiment working? • MAT window scores ~ normal with long tails • Estimate pvalue of normal from left half of data • FDR = A / B (Ctrl/ChIP peaks are all FPs) • Spike-in shows MAT FDR estimate is accurate • Can find individual failed replicate A B

ChIP-Seq ChIP-DNA Noise Map 30-mers back to the genome Sequence millions of 30-mer ends of fragments

Binding MACS: Model-based Analysis for ChIP-Seq • Use confident peaks to model shift size

Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias

Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias • 200-300bp control windows have to few tags • But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008

CEAS:Cis-regulatory Element Annotation System • Data Analysis Button for Biologists http://ceas.cbi.pku.edu.cn

ER TF?? Estrogen Receptor • Carroll et al, Cell 2005 • Overactive in > 70% of breast cancers • Where does it go in the genome? • ChIP-chip on chr21/22, motif and expression analysis found its partner FoxA1

ER AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators NRIP

ER NRIP AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators

More on TF Motif Finding ChIP-chip / seq

More on TF Motif Finding ChIP-chip / seq

Presentation Transcript

On - Chip Communication Architectures

Electroplating Solder Bumping Flip Chip Technology 电镀焊球凸点倒装焊技术

Medicaid and CHIP Health Information Exchange (HIE) Advisory Committee Meeting August 2, 2010

Flip Chip and Wafer Level Packaging Technology at Hong Kong University of Science and Technology 香港科技大學倒裝焊与晶片級封裝技術的研究

Finding Motifs in DNA

Flip Chip/Bumping Process at VTT

Designing On-chip Memory Systems for Throughput Architectures

Community HOME Investment Program (CHIP)

Transcription Regulation Transcription Factor Motif Finding

FIFO Chip Design Example

What is a motif?

Motif Finding

On-Chip Communication: Networks on Chip (NoCs)

Hunks and Chunks

AVR Family of Micro-Controllers

Testing and Faults

Gene Expression and DNA Chips