400 likes | 589 Views
ChIP-seq Methods & Analysis. Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC gschnitzler@tuftsmedicalcenter.org 617-636-0615. ChIP-seq COURSE OUTLINE. Day 1: ChIP techniques, library production, USCS browser tracks
E N D
ChIP-seq Methods & Analysis Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC gschnitzler@tuftsmedicalcenter.org 617-636-0615
ChIP-seq COURSE OUTLINE Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.
DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP
DAY 3 REMNANT Analyzing overlaps between peak & regulated genes in UNIX
How can we test the significance of binding site association w/ regulated genes? If you haven’t already, go to the cluster & move bed and txt files to your /cluster/shared/userID/chip folder (mkdir chip & cd chip if you don’t have this folder yet): cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* . The .txt files list the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver (by RNA-seq analysis).
Overlaps between peaks & genes Take a look at one of them using head [name].txt chr6 73171625 - Dnahc6 chr2 25356026 - C8g chr6 65540391 + Tnip3 … The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.
Overlaps between peaks & genes 2 The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap (these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt) Note the number of overlaps (hits), number of genes (tests) and the number of overlaps expected by chance divided by the number of genes (background frequency) provides all the information you need for binomial tests. Note these numbers down for each comparison.
Accessing the R statistical language On the PCs in this room: Start->programs->R To get R for your PC (free): http://cran.r-project.org/ To get RStudio (allows for easier management of R projects): http://www.rstudio.com/ On the cluster type: module load R Then: bsub -Ip -q int_public6 R To exit use the R command q() For more info on using R & Unix see: http://sites.tufts.edu/cbi/resources/rna-seq-course/ UNIX resources & R resources
Binomial tests in R Use the R command: binom.test(hits, tests, bkg_freq) to address the significance of overlaps you see For Ao_down_TSS.txt vs. AoE.bed: binom.test(118,2, 1.03/118) Which comparisons show significant enrichment. Do any show anti-enrichment?
DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP
What is PWM? Pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6 C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18 G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6 T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16 Con N G G T C A N N N T G A C C N • Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. • A positional frequency matrix (PFM) specifies the probability that you will see a given base at each index position of the motif. • This is built from sequences known to bind the TF (e.g. 46 sequences for the PFM below). 3’ 5’ Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
PFM->normalized PFM->PWM Binding site data Position frequency matrix (PFM) (also known as raw count matrix) • acggcagggTGACCc • aGGGCAtcgTGACCc • cGGTCGccaGGACCt • tGGTCAggcTGGTCt • aGGTGGcccTGACCc • cTGTCCctcTGACCc • aGGCTAcgaTGACGt • . • . • . • cagggagtgTGACCc • gagcatgggTGACCa • aGGTCAtaacgattt • gGAACAgttTGACCc • cGGTGAcctTGACCc • gGGGCAaagTGACTg Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position (e.g. divide by 46). Position weight matrix (PWM) (also known as position-specific scoring matrix) The normalized PFM is converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM. Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Position Weight Matrix for ERE Converting a PFM into a PWM For each matrix element do: – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Scoring putative EREs by scanning the promoter w/ PWM G G G T C A G C A T G G C C A Absolute score of the site =11.57 This is also called “functional depth” Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University
Estimating p. values for a match to the matrix G G G T C A G C A T G G C C A This sequence had a functional depth (f) of 0.86 The summed probabilities of all sequences with f >=.86 gives the p.value for this sequence = chance that f>=.86 would be achieved by a randomized DNA sequence. Short matrices can reach f > .9 but still have high p. values – thus f is the best measure for short seqs. Long matrices can have very low p. values but still have f< .9 – thus p.value is the best measure for long seqs.
DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP
Preparing for PWM search Lauch WinSCP (Start->programs->WinSCP) Navigate to: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_output_files Pull over the “ipvinput19_peaks.xls” file to the PC. (this is the MACS output file that we generated yesterday) Open it into Excel
Making .bed file w/ +/-200 bp around peak summit (where we expect TFBS enrichment will be greatest) =same row, chr column =start col+summit+200 =start col+summit-200 • Copy these 3 columns (without any header row). • In WinSCP click on any file on the PC, then on files->new->file & provide a name (“LiE_chr19_400bp.bed”) to edit a new simple text file. • Paste, save & close.
Making a file of control .bed regions peak ctrs. control regions chr start end chr start end … =peaks:chr =peaks:start-5000 =peaks:end-5000 • 5000 bp is far enough away to not be part of an enhancer composed of the ER binding site... but is close enough to likely be in the same general chromatin territory (e.g. accessible euchromatin vs. inaccessible heterochromatin) • Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP
CentDistA TFBS enrichment program designed for ChIP-seq data Assumes that TFBS-matrix hits will be most highly enriched at the centers of ChIP-seq peaks. Adds PWM score to “peakiness” score (e.g. how much more enriched the TF site is in the center of the peak) final p. val. Good enrichment poor shape (higher p.val.) Good enrichment OK shape Good enrichment good shape (best p.) Go to:http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.php?email=guest …or (more simply) just google centdist and click on the first link (should end in /centdist/)
Run CentDist Give centdist a name for your run Upload your +/-200 bp .bed file (CentDist does not need a separate background file, instead using flanking sequences at a case-specific optimized distance as background) Check “Jaspar”, “vertebrate” & set max-co-motif distance to 3000 Then click Submit Job On the new window that opens click “turn on autorefresh” so you will be notified when the job ends
Jaspar vs. Transfac Jaspar is a freely-available set of TFBS matrices that can be downloaded from jaspar.genereg.net Transfac is a commercial product that you need to pay for the latest release of. A version of Transfac (from ~2006) is available on the cluster (e.g. /cluster/home/g/s/gschni01/vertebrates.mat) Which to use? Both, ideally, but beware that programs like CentDist will give you results from Transfac matrices – and you won’t be able to find out details of what they are.
CentDist Results • View by factors, put in max number & hit go. • P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness) • Distribution graph • Weblogo representation of Jaspar matrix Shows information content at each position. A,G,C&T 25% each-> 0 bits, only 1 base 100%->2 bits. Bases most highly over-represented relative to chance are largest.
How many enriched TF sites are there really? Matrix hit enrichment for many factors, are all of them real? V$jaspar_HNF4A V$jaspar_NR2F1 V$jaspar_ESR1 Maybe not, notice how similar top sites are to each other and to estrogen response elements (EREs) such as V$jaspar_ESR1
Downloading CentDist Results Click on download as text & save the file somewhere you remember. Open it into excel. Basic summary statistics & a few other things. Many questions unanswered: -What is the fold enrichment over background? -What are the peaks with the greatest enrichment for each factor? -What are the TFBS hit locations in each peak? -Which are the real enriched TFBSes & which are just showing up by homology? -Do certain factors group together into the same same peaks?
DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP
Storm Storm is a straightforward PWM scanning program that runs in UNIX. Its greatest advantage is that it gives you all of the unprocessed output data, which allows you to do much more powerful analyses! It also allows us to specify thresholds for matches to the matrix – allowing us to use functional depth as well as p. value
Getting DNA for Storm To run storm, we first need to get the actual DNA sequence for centers of our peaks (where we expect the greatest enrichment for TFBSes to be). Go to the UCSC genome browser at: genome.ucsc.edu Under genome choose mouse mm9 Then choose add custom track & upload your +/-200 bp .bed file. Click on Tools->Table Browser Select your new track Select output format “sequence” Provide a file name “LiE_chr19_400bp.fa” & hit “get output” Hit ‘get output’ again on the next page Now do the same for your “CTRL_chr19_400bp.bed” file. .fa denotes a simple ‘fasta’ format sequence file.
Cleaning up our .fa files Use WinSCP to move these .fa files and their corresponding .bed files to your …/chip directory. Each entry in the .fa file has a header with special characters in it that confuse storm. All of the commands below are in the file /cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your screen, to copy & paste commands. To fix this, go to your …/chip directory in Putty & do: perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa To see what has changed use: head *.fa Do the same for your “CTRL_chr19_400bp.fa” file.
Running storm First set some path variables: export CREAD=/cluster/home/g/s/gschni01/cread-0.84 export PATH=$PATH:$CREAD/bin Then run storm for your IP .fa file: bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t 0.0005 -s LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat And for your control .fa file: bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t 0.0005 -s CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat Use more to look at one of your .storm output files (space for next page ctrl c to exit)
DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP
Interpreting Storm data Run the dme_parse perl program to gather and tabulate your storm data: bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed peaks
dme_parse outputs …storm.bed file: Has USCS browser tracks for each TFBS matrix with locations of all hits in bed format. …storm.map file: Lists all input matrices followed by the PFM derived from all of the hits to this matrix from our data. …storm.info file: Summarizes a lot of information about matrix hits Move the .info files to your PC with WinSCP & open them into Excel. File provides summary statistics for # of peaks with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs distance from peak center.
dme_parse outputs Using the .info file to plot relative density of TFBS hits in aorta IP, liver IP & offset controls:
dme_parse outputs Using the .info files to structure binomial tests Hits= # of matches to each matrix in IP data Tests=# of times storm tested for a match =(# of peaks) * (400 bp length of peaks - matrix length) Background freq= matches to offset conrol peak data/# tests (same as for IP) Using the .info files to determine fractional enrichment Hit frequency in IP data/Hit frequency in offset control
dme_parse outputs .freqs file: Number of hits to each matrix for each peak Distribution of hits per peak in offset background establishes # of hits to be p.<=.05 enriched over backgound • Allows identification of sites at which a given TFBS may be functionally targeted (candidates for further testing) • Can also look for significant overlaps between the peaks with enrichment for 2 different factors - to identify cooperative versus antagonistic interactions. Details on how to do these analyses are in ChIPseq_analysis_methods_2013_02_11 on the cbi website.
DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP
STAMP Go to www.benoslab.pitt.edu/stamp/index.php STAMP lets you compare matrices for evolutionary similarities to each other. Go to your CentDist output. Create a new column in which you change the names of the factors to fit with the names in the Jaspar_non_redundant_vertebrate.mat file you used for Storm. =substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down Select all matrix names w/ p.<.05 & paste them into a new file called “select_mats.txt” in your /chip folder on the cluster using WinSCP.
Getting STAMP to help classify our CentDist top hits perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat select_mats.txt select_mats.mat Now, open the select_mats.mat file with WinSCP, copy everything & paste it into STAMP. Keep all the STAMP defaults & hit submit.
STAMP Tree This indicates that enrichment of PPARG, RORA, NR4A2 could be just because of their similarity to EREs. Other enriche sites, such as SP1, FoxA2 & Myf fall in separate homology classes. To further distinguish which one is real, you can use the enrichment ratios & p. values (the “real” TFBS should be best in both of these.