Regulatory Motif Finding

Regulatory Motif Finding Wenxiu Ma CS374 Presentation 11/03/2005

Outline • Regulation of genes • Regulatory Motifs • Motif Representation • Current Motif Discovery Methods

Regulation of Genes • What turns genes on (producing a protein) and off? • When is a gene turned on or off? • Where (in which cells) is a gene turned on? • How many copies of the gene product are produced?

Overview of Gene Control • The mechanisms that control the expression of genes operate at many levels. source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

Transcriptional Regulation • The transcription of each gene is controlled by a regulatory region of DNA relatively near the transcription start site (TSS). • two types of fundamental components • short DNA regulatory elements • gene regulatory proteins that recognize and bind to them.

Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Gene Regulatory Element source: M. Tompa, U. of Washington

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene source: M. Tompa, U. of Washington

Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene source: M. Tompa, U. of Washington

What is a motif? • A subsequence (substring) that occurs in multiple sequences with a biological importance. • Motifs can be totally constant or have variable elements. • Protein Motifs often result from structural features. • DNA Motifs (regulatory elements) • Binding sites for proteins • Short sequences (5-25) • Up to 1000 bp (or farther) from gene • Inexactly repeating patterns

daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3 -150 -1 source: Peter Swoboda

Motif Representing • Consensus sequence: a single string with the most likely sequence(+/- wildcards) • Regular expression: a string with wildcards, constrained selection • Profile: a list of the letter frequencies at each position • Sequence Logo: • graphical depiction of a profile • conservation of elements in a motif.

Motif Logos: an Example (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Measure of Conservation • Relative heights of letters reflect their abundance in the alignment. • Total height = entropy-based measurement of conservation. • Entropy(i) = -SUM{ f(base, i)* ln[f(base, i)] } over all bases • Conservation(i) = 2- Entropy(i) • Units of conservation = bits of information • Entropy measures variability/disorder. • High conserved = low entropy = tall stack • Very variable = high entropy = low stack

Finding Regulatory Motifs . . . Given a collection of genes with common expression, Find the (TF-binding) motif in common

Identifying Motifs: Complications • We do not know the motif sequence • We do not know where it is located relative to the genes start • Motifs can differ slightly from one gene to another • How to discern it from “random” motifs?

Current Motif Discovery Methods • GOAL: comprehensive identification of all the regulatory motifs in genomes. • by overrepresentation • MEME, Gibbs sampling • by phylogenetic footprinting • Footprinter • Cross species comparative analysis • Combine structure information

Motif Finding: Comparative Analysis • Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. • Xie, X. et al., Nature (2005). • Identify motifs based on comparative analysis of human, mouse, rat and dog genomes • A systematic catalogue of human gene regulatory motifs • Short, functional sequences (6-10bp) used many times in a genome • Focus regions • Promoters • 3’ untranslated regions (3’ UTRs) • microRNAs (miRNAs) • post-transcriptional regulation

Motif Discovery Procedure • Alignment of promoters & 3’ UTRs • Motif conservation score (MCS) • Measure the extent of excess conservation • “Highly conserved motifs” • MCS>6 • Clustering

Alignment of promoters & 3’ UTRs • construct a whole-genome alignment for the four mammalian genomes • Blastz1 and Multiz2 • Extract the aligned promoter and 3’ UTRs portions respectively. • Coordinates: the annotation of NCBI reference sequences (RefSeq)

Motif Conservation Score (MCS) • Consensus sequence representation • Alphabet size: 11 (A,C,G,T,[AC], [AG], [AT], [CG], [CT], [GT], [ACGT]) • conserved occurrence of a motif m is an instance in which an exact match to this motif is found in all four species. • conservation rate p = ratio of conserved occurrences to total occurrences in human • Expected conservation rate p0 = avg. conservation rate of 100 random motifs, given same length and redundancy.

MCS • MCS = # of s.d. by which the observed conservation rate of a motif p exceeds the expected conservation rate p0. • p = k/n • Binomial probability of observing k out of n • Estimated by way of Normal approximation to the binomial Dist.

Conservation Properties of Regulatory Motifs • Known 8-mer TGACCTTG • Conservation rate 37% (162 out of 434) • random rate 6.8% • MCS = 25.2 s.d. • Promoter Region • TRANSFAC: 446 motifs • MCS>3: 63% • MCS>5: ~50% • 3’ UTR • no database analogous to TRANSFAC • some known motifs

Motif Discovery Procedure • Alignment of promoters & 3’ UTRs • Motif conservation score (MCS) • “Highly conserved motifs” • MCS>6 • Clustering

Results: motifs in promoters • 174 highly conserved motifs • 59 strong match to known motifs, 10 weaker match. • 105 potential new regulatory motifs Xie, X. et al., Nature, 2005

Results: motifs in 3’ UTRs • 106 highly conserved motifs • Two unusual properties • Strand specificity • Unusual length distribution

Property1: strand specificity Xie, X. et al., Nature, 2005

Property2 Xie, X. et al., Nature, 2005

Properties => miRNA • Strand specificity • 3’-UTR motifs acting at the level of RNA rather than DNA • have a role in post-transcriptional regulation • Length distribution • Many mature miRNA start with U followed by a 7-base “seed” complementary to a site in the 3’ UTR of target mRNAs. • Hypothesis: many of the highly conserved 8-mer motifs might be binding sites for conserved miRNAs.

7mG(5’)ppp(5’)G The microRNA pathway pri-miRNA Drosha Pasha 3’-nA…AAA pre-miRNA Dicer miR/miR* duplex mature miRNA miRNP Adapted from Tomari & Zamore Curr Biol 2004

Relationship with miRNA • 72 highly conserved 8-mer motifs • Contiguous, non-degenerate • ~46% of all 3’-UTR motifs • 207 distinct human miRNAs • From current registry • Complementary matches • Exactly match: ~43.5% • One mismatch: ~50% • 95% of matches begin at NT 1 or 2 of the miRNA gene • 8-mer motifs represent target sites for miRNA

8-mer motifs ->new miRNA genes • RNAfold program • 242 conserved and stable stem-loop sequences • 113 known, 129 potential new miRNAs • Biological validation • 12 selected new miRNA genes • 6 (50%) have clearly expression activity in tissues.

Prevalence of miRNA regulation • 20% of 3’ UTRs may be targets for conserved miRNA-based regulation at the 8-mer motifs. • Unbiased assessment of the relative importance of miRNA-based regulation in the human genome

Summary: comparative genome analysis • 4 mammalian species • an initial systematic catalogue • Promoters • 3’ UTRs • Importance of the new miRNA regulatory mechanism • Future directions: • genome-wide discovery • more genomes alignments: the primate

Now… • Motif Finding Methods • Cross species comparative analysis • Combine structure information

Motif Finding: Structural Knowledge • Ab initio prediction of transcription factor targets using structural knowledge, • Kaplan T, et al., PLoS Comput Biol (2005) • Propose a general framework for predicting DNA BS sequences of novel TFs from known family • Structure-based approach • No prior TF binding data and target gene • Family-wise probabilistic model • Context-specific amino acid-nucleotide recognition preferences

Structure-based approach • Family-wise probabilistic model • Input: • pairs of TFs and their target DNA sequences • structural information • Output: Context-specific amino acid-nucleotide recognition preferences • Position specificity • Then, discover TFBSs of other TFs from the same family

Cys2His2 Zinc Finger protein family • largest known DNA-binding family in multicellular organisms • common, strict binding models source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

Cys2His2 Zinc Finger: Canonical DNA binding model Residues at positions 6, 3, 2, and -1 (relative to the beginning of the a-helix) at each finger interact with adjacent nucleotides in the DNA molecule (interactions shown with arrows). Kaplan. et al., PLoS Comput Biol, 2005

Cys2His2 Zinc Finger: DNA Binding Model source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

Cys2His2 Zinc Finger: Compiling dataset • Goal: DNA-recognition preferences for each of the four key positions • every AA v.s. every NT • insufficient solved protein-DNA complex • Known protein sequence data and their DNA targets • TRANSFAC: 455 protein-DNA Pairs • Non-canonical model • Profile HMM • No exact binding locations • CX(2-4)CX(11-13)HX(3-5)H

Profile HMM • build a model representing the consensus sequence for a family, rather than the sequence of any particular member • Find potential alignment for new sequences “Silent” deletion states Insertion states Match states

Example: full profile HMM

Structure-based approach • Input: set of pairs of TFs and their target DNA sequences • Output: Context-specific amino acid-nucleotide recognition preferences • Iterative Expectation Maximization(EM) algorithm

Cys2His2 Zinc Finger: Probabilistic Model • The set of interacting residues in 4 different positions of the k fingers • N1,… NL be a target DNA sequence • The probability that an interaction starting from jth position in the DNA • where PP(N|A) is the conditional probability of nucleotide N given amino acide A at position p. Kaplan. et al., PLoS Comput Biol, 2005

EM algorithm • Iterative EM algorithm • Exact binding locations for all protein-DNA pairs • recognition preferences: Pp(N|A) • E-step • Compute expected posterior probability of binding locations, based on current preferences • M-step • Update DNA-recognition preferences to maximize the likelihood of current binding locations based on the distribution of possible binding locations in previous E-step • Local optima

Estimate DNA-recognition preferences Kaplan. et al., PLoS Comput Biol, 2005

Apply on TFs from the same family Kaplan. et al., PLoS Comput Biol, 2005

Evaluation • compatible with experimental results • 10-fold cross validation • genome-wide scan of Drosophia melanogaster • 29 canonical Cys2His2 TFs • GO Enrichment of predicted target genes • 21 enriched with at least one GO term. • mRNA expression profile of target genes • 21 showed significant associations in at least one embryogenesis experiment.

Regulatory Motif Finding