220 likes | 358 Views
Combined analysis of ChIP-chip data and sequence data Harbison et al. CS 498 SS Saurabh Sinha. Outline. Transcription factors interpret the regulatory information encoded in DNA to induce or repress gene expression
E N D
Combined analysis of ChIP-chip data and sequence dataHarbison et al. CS 498 SS Saurabh Sinha
Outline • Transcription factors interpret the regulatory information encoded in DNA to induce or repress gene expression • Comparative genomics has been used to find the regulatory sites in yeast genome • Looking at sequence alone does not reveal if a putative site is actually functioning as a binding site • ChIP-chip data (also called “location data”) provides such information • Harbison et al combine these two types of data
Chip-on-chip Source: http://www.chiponchip.org/
Data • Genome-wide “location analysis” using ChIP-on-chip • Each experiment done with one TF • 203 TFs experimented with, in “rich media conditions” • 84 of these TFs also experimented with in at least one other condition • Why? • Binding is not just a function of the presence of the site. It is also a function of the presence of the TF • TF may not be present in every condition
Data • How were the 84 TFs (to be tested in additional conditions) chosen? • If there was prior evidence that they play a role in that additional condition
ChIP-on-chip results • 11,000 unique interactions between TFs and promoter regions identified • A matrix of (m x n), where m is the number of TFs (203), n is the number of yeast genes (~6000) • 11,000 of the entries were “1”, meaning the binding was significant • Need post-processing of binding affinities to assess if it is statistically significant
The next step: bring in the sequence • Genome-wide “location data” or “binding data” combined with sequence data • For each TF, collect all sequences bound by it • These are promoter length sequences, not exact binding sites • Apply motif finding programs to estimate what the binding motif is (where the binding sites are)
Motif finding • Only consider TFs that bound >= 10 sequences • 147 such TFs • Run 6 different motif-finders on the bound sequences • 68000 motifs discovered ! • A large number of these motifs are “variants” of the same motif, i.e., similar to each other
Motif finding • Using clustering of motifs, and stringent statistical tests, identify high confidence motifs from among these 68000 motifs • High confidence motifs found for 116 of the 147 TFs whose bound sequences were analyzed • Now require that the motif also be conserved across other related yeast species • 65 TFs with single, high-confidence, phylogenetically conserved motifs were found
Motif finding • The 65 motifs were a mix of “known” and novel motifs. • That is, some of the motifs were similar to already known motifs • 21 TFs’ motifs were new • Took these 65 motifs, as well as other known motifs from the literature to form a compendium of 102 motifs for further analysis
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Next step • We now have motifs for 102 TFs • Next step is to locate binding sites of each TF in the whole genome • Equivalent to finding matches to each motif in the whole genome • Finding matches: • Require a high sequence similarity • Require phylogenetic conservation • Require high binding to that region by TF
Mapping sites in the genome • “Map” gave 3353 sites (“interactions”) within 1296 promoters • This is different from simply locating matches to motif • Because TF binding information is also incorporated • Under different conditions, only a subset of the binding sites in the map are actually occupied
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Does the map make sense? • The map is telling us which TFs bind which actual sites in the genome, and hence which genes are being regulated • In many cases, the known functions of the genes predicted to be targeted by a TF are consistent with the known function of the TF
More insights from the map • Binding sites are not uniformly distributed over the promoter regions • Sharply peaked distribution • Very few sites in 100 bp immediately upstream of the genes • Most sites (74%) are between 100 and 500 bp of gene Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Arrangements of sites • Specific arrangements of binding sites in a promoter • Simple arrangement: one binding site for one TF • Another arrangement: Repeats of a particular binding site • Allows for “graded response” • Some TFs show a significant preference for repeated sites
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Arrangements of sites • Another arrangement: Binding sites for multiple TFs • “Combinatorial regulation”: In different conditions, different combinations of binding sites (and TFs) direct different gene expression • Genes whose promoters have such arrangement of sites are required for multiple pathways, and regulated in environment-specific fashion
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Arrangements of sites • Another arrangement: Binding sites for specific pairs of TFs occur more frequently in same promoter than expected by chance • The two TFs perhaps interact physically in doing their job
Source: Harbison et al. Nature 431, 99-104(2 September 2004)