410 likes | 621 Views
cisGreedy Motif Finder for Cistematic. Sarah Aerni Mentors: Ali Mortazavi Barbara Wold. cisGreedy. De novo motif finder which implements a greedy algorithm similar to Consensus motif finder
E N D
cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold
cisGreedy • De novo motif finder which implements a greedy algorithm similar to Consensus motif finder • Goal: To provide an efficient algorithm to be included in the Cistematic package that performs similarly to Consensus and meme
Cistematic • Integrate visualization, refinement of motifs and improve performance of multiple motif finders in a single package Mortazavi, 2006
Cistematic • cisGreedy becomes part of “Bottom Tier” • Motif finder would be included in the Cistematic package (prevents need for complicated installations) Image: Ali Mortazavi
What is a Motif? • cis-Regulatory elements • Transcription Factor Binding Sites(TFBS) • Binding by transcription factors may increase or decrease transcription of genes
What is a Motif? • GAL4 in Yeast • Activator of galactose-induced genes (convert galactose to glucose) • Protein structure determines motif • DNA-protein interactions require certain bases at specified locations • Motif reflects homodimer structure
What is a Motif? • cis-Regulatory elements • Transcription Factor Binding Sites(TFBS) • Binding by transcription factors may increase or decrease transcription of genes • Gene Regulation believed to be a major source of complexity • Plants may have more genes or larger genomes than humans – are they more complex? • Identification of cis-regulatory elements will help us understand gene regulatory networks (bigger picture)
How do we find motifs? • Hard to identify • Relatively short sequences (as small as 6 bases) • Many positions not well conserved • Factors improving identification • Usually localized in certain proximity of a gene (search within 3 kb upstream) • Some positions highly conserved • Use other data (Microarray?)
Motif Finders • Greedy • Maximizes similarity of motifs from sequences through a greedy approach • Eliminate background modeling by using Cistematic package preprocessing steps • Improves speed • Prevents false negatives • Implements multiple models (zoops, oops, TCM)
ConsensusScoring • Use equation similar to log likelihood called Information Content L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probability of letter i • Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.
Removing Background • Goal of a background model: differentiate noise from signal • Issues with background: • What background should be used? • Whole genome? Conserved regions? • Selective pressures maintain conserved regions • Arguably searching in conserved regions guarantees there is little noise (it has been maintained) • Solution: • Search in conserved regions • Use simple repeat masking • Sequences which reoccur are likely TFBS
cisGreedy scoring • Scoring focuses on maximizing number of identical bases • Percent identity is dependent on number of deviations from the strict consensus • Background adds complexity that may lead to false negatives
cisGreedy • Input sequences are analyzed • Randomly select 2 sequences to be compared
cisGreedy • The two selected sequences are analyzed independently of the remaining sequences
cisGreedy • The two selected sequences are analyzed independently of the remaining sequences • Windows of motif size are scanned starting at the beginning of each sequence
cisGreedy • Sequences are scanned in an attempt to locate the highest scoring alignment • Alignments are ungapped • Score is established as the number of sequences containing the most frequently occurring base at each position
cisGreedy • Reverse Complements are analyzed (user specified) • Once start locations are established with a top alignment score, these are left unchanged (Greedy)
cisGreedy • Select an additional sequence in which to identify the location of the motif • Windows in the additional sequence are aligned to previously established windows (Greedy)
cisGreedy • Additional sequence scanned as before, reverse complement (user specified) • Alignment score established as before
cisGreedy • Final motif locations are used in order to build position specific frequency matrices • Reverse complement sequence used in building PSFM if used
Testing cis-Greedy • AIY • 16bp cis-regulatory motif drives expression • Experimentally verified • Gene battery consists of a set of genes bound by AIY • Orthologous genes contain highly specified binding sites • Individual binding sites of battery genes within a single species can vary considerably (Wenick and Hobert 759)
Cistematic Results for AIY regions of conservation orthologous genes hen-1 hen-1
Results for AIY AIY Identified AAATTGGCTTCCTCAAA cisGreedy TTTGAGGAAGCCAATTT (reverse comp) AAATTGGCTTCCTCAAA meme AAATTGGCTTCCTCAAA AIY- Battery Consensus
Results for AIY hen-1
Tompa Bakeoff • 3 benchmark datasets • Real • Markov Chain • Generic • 4 organisms • Human • Mouse • Fruitfly • Yeast • Each dataset contains 0-1 motifs. • Each sequence can have 0 or multiple motifs • Report 0-1 motif per dataset and locations of motifs • Use statistical tools provided by bakeoff to analyze runs
Bakeoff example (hm03) • Identify most reasonable motif based in each dataset independently
Real Real Interesting pattern appears between 3 of 10 sequences
Bakeoff example (hm03) • Identify most reasonable motif based in each dataset independently • Determine which motif appears most reasonable across 3 benchmarks and map motif in sequences using Cistematic • Compare results to actual locations (provided in bakeoff package)
Real Real
Bakeoff results • Correlation Coefficient: nCC = (nTP nTN - nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) • Sensitivity (fraction of known sites that are predicted): sSn = sTP / (sTP + sFN) • Positive Predictive Value (fraction of predicted sites that are known.) sPPV = sTP / (sTP + sFP)
Bakeoff results • cisGreedy overall 7th best performer (excluding those with no data): • Overall top performer in fly • Worst performer in yeast • 3rd worst performer in mouse • 4th best performer in human
Bakeoff results Adapted from Tompa, 2005 When running programs in parallel, correlation of motif finder results to true binding sites improves
Future goals • Complete analysis of results for cisGreedy using benchmarks established by Tompa paper (Nature Biotech, 2005) • Document results and algorithm development • Continue improving cisGreedy
References • Bioalgorithms.info • Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms . : MIT Press , 2004. • Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577. • Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144. • Wenick, Adam S., and Oliver Hobert. "Genomic cis-Regulatory Architecture and trans-Acting Regulators of a Single Interneuron-Specific Gene Battery in C. elegans." Developmental Cell 6(2005): 757-770. • http://cistematic.caltech.edu
Acknowledgements • Ali Mortazavi • Barbara Wold • Wold Lab funding provided by DOE & NASA • Additional funding by NSF & NIH • SoCalBSI faculty, staff and fellow students