240 likes | 286 Views
Transcription factor binding sites and gene regulatory network. Victor Jin Department of Biomedical Informatics The Ohio State University. Transcription in higher eukaryotes. Gene Expression Chromatin structure Initiation of transcription Processing of the transcript
E N D
Transcription factor binding sites and gene regulatory network Victor Jin Department of Biomedical Informatics The Ohio State University
Transcription in higher eukaryotes • Gene Expression • Chromatin structure • Initiation of transcription • Processing of the transcript • Transport to the cytoplasm • mRNA translation • mRNA stability • Protein activity stability
Transcriptional Regulation Nuclear membrane
Transcriptional Regulation Nuclear membrane Binding site/motifCCG__CCG Genome-wide mRNA transcript data (e.g. microarrays)
Transcriptional Regulation Learning problems: • Understand which regulators control which target genes Nuclear membrane Binding site/motifCCG__CCG • Discover motifs representing regulatory elements
Some common approaches • Cluster-first motif discovery • Cluster genes by expression profile, annotation, … to find potentially coregulated genes • Find overrepresented motifs in promoter sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …) (Spellman et al. 1998)
Training data – Features regulator expression promoter sequence label feature vector
Pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A 18 8 5 4 1 29 7 7 7 0 1 39 1 1 6 C 8 3 3 9 33 4 21 15 14 0 0 1 43 39 18 G 13 31 34 9 8 10 11 15 19 4 44 3 0 1 6 T 7 4 4 24 4 3 7 9 6 42 1 3 2 5 16 Con N G G T C A N N N T G A C C N What is PWM? • Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. • A positional weight matrix (PWM) specifies the probability that you will see a given base at each index position of the motif.
PWM for ERE Position frequency matrix (PFM) (also known as raw count matrix) • acggcagggTGACCc • aGGGCAtcgTGACCc • cGGTCGccaGGACCt • tGGTCAggcTGGTCt • aGGTGGcccTGACCc • cTGTCCctcTGACCc • aGGCTAcgaTGACGt • . • . • . • cagggagtgTGACCc • gagcatgggTGACCa • aGGTCAtaacgattt • gGAACAgttTGACCc • cGGTGAcctTGACCc • gGGGCAaagTGACTg Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position. Position weight matrix (PWM) (also known as position-specific scoring matrix) PFM should be converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM.
Position Weight Matrix for ERE Converting a PFM into a PWM For each matrix element do: – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b
Scoring putative EREs by scanning the promoter with PWM G G G T C A G C A T G G C C A Absolute score of the site =11.57
Yeast ESR: Biological Validation Universal stress repressor motif STRE element
Previous work: “Structure learning” • Graphical models (and other methods) • Learn structure of “regulatory network”, “regulatory modules”, etc. • Fitinterpretable model totraining data • Model small number of genes or clustersof genes • Many computational and statistical challenges; often used for qualitative hypotheses rather than prediction (Pe’er et al. 2001) (Segal et al, 2003, 2004)
Network inference P Mp P TF P MTF Mp M • Regulator-motif associations in nodes can have different meanings: • Need other data to confirm binding relationship between regulator and target (e.g. ChIP-chip) • Still, can determine statistically significant regulator-target relationships from regulation program Direct binding Indirect effect Co-occurrence
Binding data for regulatory networks • ChIP-chip: genome-wide protein-DNA binding data, i.e. what promoters are bound by TF? • Investigate regulatory network model: use ChIP-chip data in place of motifs (no motif discovery) • Features: (regulator, TF-occupancy) pairs P1 P2 TF
Inferring regulatory networks from the combination of expression data and binding data
RUVBL1 GTF2I ZNF500 TTF2 RFC1 RXRA MKL2 ZKSCAN1 RAB18 HSF2 ASCC3 BHLHB2 MSX2 PNN HIF1A ZNF38 BAZ1B HEY2 ER STRAP CEBP DNMT1 XBP1 NRIP1 TLE3 LASS2 ZNF394 VPS72 ZNF239 THRAP1 FOXP4 HDAC1 TXNDC ZBTB41 BRIP1 FOS TBX2 TXNIP MYC PAWR ELF3 IVNS1ABP CHAF1B PURB DDX20 C140RF43 BATF CSDE1 SP3 HES1 ADAR CUTL1 An extended ER regulatory network in MCF7 cells CCNL1 BRF1
Signaling molecules -- Networks TF SM mRNA Glc7 phosphatase complex Gac1 Hsf1 Sds22 Gip1 • Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features • e.g. • Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets • (Interaction supported in literature)
http://motif.bmi.ohio-state.edu/ChIPMotifs/ • FASTA file • Contact Info • Control data (optional) Input Data • Weeder • MaMf • MEME Ab initio Motif Discovery Programs • Bootstrap re-sampling • Fisher test Statistical Methods STAMP Matching • SeqLog • PWM • P-value • Known or novel motifs Results
Software Demo • W-ChIPMotifs • HRTargetDB