380 likes | 597 Views
DNA Motif Finding. Katherina Kechris Introduction to Bioinformatics BIOI 7710/7711 Lecture 12 10/6/05. DNA Motifs. Short repeating sequence elements in the genome can have important regulatory function Transcription, splicing, post-transcriptional processing, …
E N D
DNA Motif Finding Katherina Kechris Introduction to Bioinformatics BIOI 7710/7711 Lecture 12 10/6/05
DNA Motifs • Short repeating sequence elements in the genome can have important regulatory function • Transcription, splicing, post-transcriptional processing, … • Motifs are representations of known examples • Local multiple sequence alignment
Genes to Proteins actggtacgtggaccgttacg acugguacguggaccguuacg TGTWTVT
Environment yeast: Gal4 galactose-rich conditions Development drosophila: Hb, Bi, Kr, Gt embryonic-patterning Tissue-specific mammals: C/EBP b liver Examples: Transcription Factors Expression of even-skipped (eve)
Transcription FactorBinding Sites CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG CGGAGGGCTGTCGCCCG CGGAGGAGAGTCTTCCG CGGAGCAGTGCGGCGCG CGCGCCGCACTGCTCCG CGGAAGACTCTCCTCCG CGGGCGACAGCCCTCCG CGGATTAGAAGCCGCCG CGGGGCGGATCACTCCG CGGCGGTCTTTCGTCCG CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG
DatabasesTRANSFAC: http://www.gene-regulation.com/pub/databases.html#transfac Binding Sites
MoreDatabases Species-specific: SCPD (yeast) http://rulai.cshl.edu/SCPD/ DPInteract (e. coli) http://arep.med.harvard.edu/dpinteract/ Drosophila DNase I Footprint Database (v2.0) http://www.flyreg.org/
Transcription FactorBinding Sites CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG CGGAGGGCTGTCGCCCG CGGAGGAGAGTCTTCCG CGGAGCAGTGCGGCGCG CGCGCCGCACTGCTCCG CGGAAGACTCTCCTCCG CGGGCGACAGCCCTCCG CGGATTAGAAGCCGCCG CGGGGCGGATCACTCCG CGGCGGTCTTTCGTCCG CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG
Motif Representations CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG ... CGGGGCAGACTATTCCG • Consensus • Frequency Matrix • Logo CGGNGCACANTCNTCCG
Logos • Graphical representation of nucleotide base (or amino acid) conservation in a motif (or alignment) • Information theory • Height of letters represents relative frequency of nucleotide bases http://weblogo.berkeley.edu/
Position Weight Matrix (PWM) Frequency Matrix Weight Matrix background frequencies NOTE: Use pseudo-counts for zero frequencies
Predicting Motif Occurrences:Sequence Scoring a g c g g t a Sum = 13.5 Sum = -15.6
Novel Motif Prediction • Goal: Characterize and predict locations of novel motif in sequences • Challenges: • Short (6-20 bases) • Degenerate • Locations not fixed • Signal to noise • eg., yeast 600-800bps
Problem Data: Upstream sequences from co-regulated/co-expressed genes. Assumption: Binding site occurs in most sequences 1: actcgtcggggcgtacgtacgtaacgtacgtacggacaactgttgaccg 2: cggagcactgttgagcgacaagtacggagcactgttgagcggtacgtac 3: ccccgtaggcggcgcactctcgcccgggcgtacgtacgtaacgtacgta 4: agggcgcgtacgctaccgtcgacgtcgcgcgccgcactgctccgacgct Goals: 1) Estimate motif 2) Predict locations of motifs 1: actcgtcggggcgtacgtacgtaacgtacgtaCGGACAACTGTTGACCG 2: cggagcactgttgagcgacaagtaCGGAGCACTGTTGAGCGgtacgtac 3: ccccgtaggCGGCGCACTCTCGCCCGggcgtacgtacgtaacgtacgta 4: agggcgcgtacgctaccgtcgacgtcgCGCGCCGCACTGCTCCGacgct
Strategies • Deterministic • Regular expression representation A-C-[AG]-x(2,5)-T-x(2)-A • Enumerative • Probabilistic • Statistical model • Frequency matrix
Strategies • Deterministic • Enumerative • Regular expression representation (consensus) A-C-[AG]-x(2,5)-T-x(2) • Probabilistic • Statistical model • Frequency matrix
Model cggagcactgttgagcgacaagtaCGGAGCACTGTTGAGCGgtacgtac Positions are independent, non-identically distributed Background Positions: Positions are independent, identically distributed • Motif start-positions are missing data • Assume one motif occurrence per sequence • Goals: 1) estimate motif and 2) predict locations of motifs
Basics for Estimation • Conditional on frequency matrix For each sequence k and position j e.g., sequence = “ctCGTCggggc” , j = 3, motif width W = 4 • Conditional on motif start-positions j e.g., N = number of sequences = number of b’s at motif position i cgTACGtaacg acaagtaCGGA cCCCGtaggcg cgcgCGCCgca
Estimation: Method I • Gibbs Motif Sampler • Bayesian model, prior distribution • Algorithm (MCMC) Initialization: Randomly select motif start-positions in each sequence Iterations: Remove randomly selected sequence k’ • Update frequency matrix • Randomly select a motif start-position j for k’ proportional to:
Gibbs Motif Samplerhttp://bayesweb.wadsworth.org/gibbs/gibbs.html
Estimation: Method II • MEME • Missing data problem: Expectation-Maximization (EM) Algorithm to obtain maximum likelihood estimates • EM Algorithm Initialization: Set frequency matrix p and p0 Iterations: • E-step: Calculate probability of motif start-positions For each sequence k and position j Wkj= Pr(motif start-position = j | p) • M-step: Update frequency matrix estimate
Model Extensions • Multiple occurrences in sequence • Motif width • Multiple motifs • Alternative background models • Palindromes • Gapped motifs • Dependencies between positions Software: AlignACE (Roth et al., 1998), BioProspector (Liu et al., 2001) Sometimes predicted motifs do not look “real”. They are not reflecting structural constraints.
Examples: Information Content Bi-modal yeast : gal4, abf1, pho4 E. coli : crp, purR Uni-modal
Goal: Incorporate Structural Constraints into the Model • Nature of transcription factor - DNA interactions imposes constraints on the motifs …. not all motifs are equally likely! • Objective is to bias the search for motifs which reflects these types of structural constraints. CGGACAACTGATGACCG CGGAGCACAGTTGAGCG CGGCGGCTTCTAATCCG CGGAGGGCTGTCGCCCG Methods: TFEM (Kechris et al., 2004), van Zwet et al., (2005)
TFEM: Blocks • For each position i= 1,2,… W, assign prior distribution f on multinomial parameters pi • According to block, prior distribution: high (fh) or medium (fm) information • Prior distribution penalizes deviations from high or medium information HIGH HIGH MEDIUM Bi-modal Information (reverse for Uni-modal)
TFEM: Change Points • May not know change points between blocks • Include unobserved random variable for all change point pairs (s,t):W(W+1)/2 + 1 possible pairs
TFEM: Results • Application • Sequences from co-regulated/co-expressed genes • Knowledge about transcription factor (family, structure) • Use method with expected motif shape (uni/bi-modal) • Results • Extended model with prior distribution helps bias the search • Evaluated with decoy motifs and “noisy” data (longer sequences)
Recent Directions • Experimental Data • Microarrays • ChIP-Chip • Phylogenetic Analysis • Cross-species comparisons • Higher organisms • Motif Modules
References • Reviews • Stormo GD (2000), Bioinformatics, 16:16-23 • Bulyk (2003), Genome Biology 5:201 • Logos • Schneider & Stephens (1990), Nucleic Acids Res. 18:6097-6100 • Enumerative • Jones and Pevzner (4.4-4.6) • Brazma et al. (1998), J. Comp. Bio, 5:279-305 • Probabilistic • GMS: Lawrence et al. (1993), Science, 262:208-214 • MEME: Bailey & Elkan (1995), Machine Learning, 21:51-80 • Structural Constraints • Kechris et al. (2004), Genome Biology, 5(7):R50. • van Zwet et al. (2005), Stat. Appl. Genet. & Mol. Biol., 4(1) Article 1.