180 likes | 322 Views
Starting Monday. Today: Chip-chip and Chip-seq analysis. M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in lab we will predict orthologs using reciprocal genome-scale BLAST searches
E N D
Starting Monday Today: Chip-chip and Chip-seq analysis M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in lab we will predict orthologs using reciprocal genome-scale BLAST searches W Oct 31 – Phylogenetic Profiles ( an example of unsupervised machine learning) and supervised machine learning approaches and applications M Nov 5 - Phylogeny (Phylogeny Lab) W Nov 7 – Metabolic reconstruction and modeling ***2-3 pg paper on preliminary results due***
Chromatin immunoprecipitation (ChIP) • Chemical or light-based • crosslinking added to • living cells • Shear DNA by sonication or • digestion • IP by specific Ab or • Ab against protein tag
ChIP on ChIP (tiled genomic microarrays) Peak resolution a function of: - shearing size - probe resolution - ChIP enrichment Signal Intensity Array Probes
ChIP - Seq Read Counts
Map reads to the reference genome • Convert to ‘tag’ counts: sequence coverage at each base pair in the genome • Find peaks of high tag count (using a fixed/sliding window with count threshold) • or based on bimodal peak distribution • Convert bimodal peaks into summits (by shifting 3’ tag positions OR by • extending the tag signal to estimated size of fragments) • Identify summits that represent fragment enrichment relative to control • Assign a confidence score (p-value, enrichment score, and/or FDR)
Types of ‘control’ data for ChIP experiments • ‘Input’ DNA = sheared but no IP • No-antibody mock IP • Untagged strain Almost always some background in mock-IP … hope is to have enrichment of IP material over background. * Certain artifacts can give the appearance of real peaks in control experiments.
Read counts/ tag profile is generally smoothed before peak calling (e.g. running average) and then the ‘summit’ is inferred by the dual read peaks * using a method that incorporates measured background model is probably very important Pepke et al. 2009
3 Types of peaks 1. Sharp & narrow (100s bp) (eg. site-specific TF) 2. Broader but defined (kb) (eg. RNA Polymerase) 3. Very broad (regional, 1000s kb) (eg. heterochromatin histone marks) • methods that identify bimodal peak profiles to identify summits work less well for • biologically wider peaks/loci
Hidden Markov Models for Identifying Bound Fragments HMM’s are trained on known data to recognize different states (eg. bound vs. unbound fragments) and the probability of moving between those states Once trained, an HMM can be used to identify the ‘hidden’ states in an unknown dataset, based on the known characteristics of each state (‘emission probabilities’) and the probability of moving between states (‘transition probabilities’) Example:ChIP-chip data from a tiling microarray identifying regions bound to a transcription complex with a known 50bp binding sequence. You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long. Example: “Ahidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences” 2005. Li, Meyer, Liu
Example:ChIP-chip data from a tiling microarray identifying regions bound to a transcription complex with a known 50bp binding sequence. You expect that a bound fragment will have high signal on the array and that the bound fragment will be2-3probes long. P= 1.0 P= 0.5 P= 0.3 P= 0 P= 0.5 P= 1.0 P= 0.7 P( I) = 0.2 P( i ) = 0.8 P( I ) = 0.8 P( i) = 0.2 P( I ) = 0.8 P( i) = 0.2 P( I ) = 0.8 P( i) = 0.2 Unbound 25mer Bound 25mer Bound 25mer Bound 25mer I = Intensity units > 10,000 i = Intensity units < 10,000
Example:ChIP-chip data from a tiling microarray identifying regions bound to a transcription complex with a known 50bp binding sequence. You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long. Transition Probabilities Emission Probabilities P= 1.0 P= 0.5 P= 0.3 P= 0 P= 0.5 P= 1.0 P= 0.7 P( I) = 0.2 P( i ) = 0.8 P( I ) = 0.8 P( i) = 0.2 P( I ) = 0.8 P( i) = 0.2 P( I ) = 0.8 P( i) = 0.2 Unbound 25mer Bound 25mer Bound 25mer Bound 25mer Given the data, an HMM will consider many different models and give back the optimal model
Evaluated 11 different peak-calling algorithms using 3 real datasets * & default parameters (mimicking “non-expert users”) - methods with smaller peak lists often return peaks identified by other methods (more stringent) “many programs call similar peaks, though default parameters are tuned to different levels of stringency”
Output: list of peak locations (start & stop) and p-values Challenge is peaks do not show precisely where protein binds. Different programs vary in the width of the identified peaks Can apply the same type of motif finding to a set of IP’d regions to identify motifs shared by regions.
Other approaches Micrococcal nuclease sensitive sites (nucleosome mapping) DNaseI hypersensitive sites ChIP-exo
What can you do with the data? • Motif finding: look for motif shared in bound regions (e.g. XX) • Association bound loci with neighboring genes, elements • functional enrichment of neighboring genes • other non-random association among neighboring genes, • e.g. shared expression profiles, expression dependency on factor in question • Locus distribution across the genome