1 / 27

Pattern Discovery and Recognition for Understanding Genetic Regulation

Pattern Discovery and Recognition for Understanding Genetic Regulation. Timothy L. Bailey Institute for Molecular Bioscience University of Queensland. Recent Work. Identifying statistically significant regulatory modules Computing motif statistics Evaluation of motif discovery algorithms

Download Presentation

Pattern Discovery and Recognition for Understanding Genetic Regulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

  2. Recent Work • Identifying statistically significant regulatory modules • Computing motif statistics • Evaluation of motif discovery algorithms • Future directions: motif discovery in sets of orthologous sequences

  3. Identifying Statistically Significant Regulatory Modules • Overview of the problem • Previous research • The MCAST algorithm • Validation • Discussion

  4. Problem Statement • Given a set of one or more motifs, can we identify the genes that they regulate by searching a genomic database?

  5. The Problem is Hard • The futility theorem: the vast majority potential TF binding sites are false positives (Wasserman). • This is because TF binding sites are short and degenerate, so they occur frequently at random in DNA.

  6. The Approach • Groups of transcription factors often operate in concert, binding near each other. • Multiple binding sites for the same TF often occur close together. • Whereas individual binding sites cannot be statistically significant, clusters may be.

  7. MCAST • Hybrid of Cisanalyst and COMET • Based on Meta-MEME (CABIOS Grundy et al. 13:397-406, 1997) • MCAST has two input parameters: • Motif p-value threshold (p) • Maximum gap size (L) • MCAST builds a motif-based HMM and uses the Viterbi algorithm to find clusters.

  8. +3 +1 +1 -2 Definition of a Motif Cluster • A “cluster” is a collection of “hits” (matches to motifs) with with no gaps longer than L. • Hits are shown schematically as beads on a string. The number is the motif identifier. +/- indicates which DNA strand the hit is on.

  9. Genomic DNA d2 d3 d4 One cluster h1 h2 h3 h4 Gap penalty Hit scores Gap widths Cluster Scoring Function

  10. Performance metrics • ROC50 measures the area under a curve that plots true positive rate as a function of false positive rate, up to the 50th false positive. • KB60 is the average number of kilobases per false positive at a threshold that yields 60% sensitivity. • For both metrics, larger is better.

  11. Four Data sets • Drosophila Eve regulators (Bcd, Cad, Hb, Kr, Kni). • 19 positives and 2039 putative negatives. • Human LSF-regulated promoters (LSF, Sp1, Ets, TATA). • 9 positives and 2005 putative negatives. • Human muscle-specific promoters (Mef-2, Myf, SRF, Tef, Sp1). • 27 positives and 2005 putative negatives. • Muscle* - motifs generated without muscle-specific genes.

  12. Comparison with COMET Red indicates better performance.

  13. Computing motif statistics • Looking for fast ways to compute the probability of a local, multiple alignment. • Objective function of the latest version of the MEME algorithm.

  14. Computing the statistics of random alignments • Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance. • Computing motif significance is therefore critical to any motif discovery approach.

  15. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 …ARO4 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ILV6 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 Sequences …ARO1 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …HOM2 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …PRO3 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA fij=nij/N 1 GACATCGAAA 2 GCACTTCGGC GAGTCATTAC i GTAAATTGTC CCACAGTCCG N TGTGAAGCAC Frequencies 12 … w j Measuring the goodness off DNA regulatory motifs: IC nij IC =IC1+ …+ICw Alignment Information Content Counts

  16. POP: product of ICp-values • IC is the sum of the information contents of the motif columns. • POP is an alternative measure of motif quality: the product of the p-values of the column information contents.

  17. Statistics of IC scores • Large deviation method for computing distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999). • Time to compute the p-value of one IC score is O(N2). • MEME computes O(w2N) IC scores per motif, so the total time—O(w2N3)—is prohibitive. • POP p-values can be computed efficiently.

  18. Correction factor for POP p-values • The p-value of POP score, p, is roughly: • Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values. • Empirically, the p-value error for POP, p, letting x = ln(p), is about

  19. Estimating the POP p-value correction factor parameters • To estimate the correction factor parameters we: • estimate the right tail of the distribution using a convolution method, • fit the (non-linear) correction function to the tail of the distribution using a least squares approach. • The CPU time per motif to compute POP p-values is negligible once the correction factor parameters are known.

  20. CPU time per motif using LD method to compute p-values w=16

  21. CPU time to estimate correction factor parameters w=16

  22. Speedup using POP statistic

  23. Discovering regulatory elements in orthologous genes • De novo discovery of most known regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003). • We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

  24. Evaluation of motif discovery algorithms • Joint work with Martin Tompa and others. • Eighteen motif discovery algorithms were tested evaluated on DNA regulatory motifs in four organisms. • Each algorithm was run by experts in that particular algorithm. • The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

  25. Performance of Motif Discovery Algorithms Finding Regulatory Motifs

  26. Conservation of known regulatory elements in sets of orthologous genes Human vs. Mouse Four yeast species Regulatory elements Regulatory elements Background sequences Background sequences Source: Liu et al., Genome Res 14:451-458, 2004.

  27. Large-scale discovery of human regulatory elements • Compared with yeast, regulatory elements make up less of human intergenic DNA (3% vs. 15%). • The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species. • Large-scale motif discovery should be possible using human and mouse orthologous genes.

More Related