cisGreedy Motif Finder for Cistematic

cisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold

cisGreedy • De novo motif finder which implements a greedy algorithm similar to Consensus motif finder • Goal: To provide an efficient algorithm to be included in the Cistematic package that performs similarly to Consensus and meme

Cistematic • Integrate visualization, refinement of motifs and improve performance of multiple motif finders in a single package Mortazavi, 2006

Cistematic • cisGreedy becomes part of “Bottom Tier” • Motif finder would be included in the Cistematic package (prevents need for complicated installations) Image: Ali Mortazavi

What is a Motif? • cis-Regulatory elements • Transcription Factor Binding Sites(TFBS) • Binding by transcription factors may increase or decrease transcription of genes

What is a Motif? • GAL4 in Yeast • Activator of galactose-induced genes (convert galactose to glucose) • Protein structure determines motif • DNA-protein interactions require certain bases at specified locations • Motif reflects homodimer structure

What is a Motif? • cis-Regulatory elements • Transcription Factor Binding Sites(TFBS) • Binding by transcription factors may increase or decrease transcription of genes • Gene Regulation believed to be a major source of complexity • Plants may have more genes or larger genomes than humans – are they more complex? • Identification of cis-regulatory elements will help us understand gene regulatory networks (bigger picture)

How do we find motifs? • Hard to identify • Relatively short sequences (as small as 6 bases) • Many positions not well conserved • Factors improving identification • Usually localized in certain proximity of a gene (search within 3 kb upstream) • Some positions highly conserved • Use other data (Microarray?)

Motif Finders • Greedy • Maximizes similarity of motifs from sequences through a greedy approach • Eliminate background modeling by using Cistematic package preprocessing steps • Improves speed • Prevents false negatives • Implements multiple models (zoops, oops, TCM)

ConsensusScoring • Use equation similar to log likelihood called Information Content L columns in the matrix A = {A,C,G,T} frequency of each letter i at each position j a priori probability of letter i • Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577.

Removing Background • Goal of a background model: differentiate noise from signal • Issues with background: • What background should be used? • Whole genome? Conserved regions? • Selective pressures maintain conserved regions • Arguably searching in conserved regions guarantees there is little noise (it has been maintained) • Solution: • Search in conserved regions • Use simple repeat masking • Sequences which reoccur are likely TFBS

cisGreedy scoring • Scoring focuses on maximizing number of identical bases • Percent identity is dependent on number of deviations from the strict consensus • Background adds complexity that may lead to false negatives

cisGreedy • Input sequences are analyzed • Randomly select 2 sequences to be compared

cisGreedy • The two selected sequences are analyzed independently of the remaining sequences

cisGreedy • The two selected sequences are analyzed independently of the remaining sequences • Windows of motif size are scanned starting at the beginning of each sequence

cisGreedy • Sequences are scanned in an attempt to locate the highest scoring alignment • Alignments are ungapped • Score is established as the number of sequences containing the most frequently occurring base at each position

cisGreedy • Reverse Complements are analyzed (user specified) • Once start locations are established with a top alignment score, these are left unchanged (Greedy)

cisGreedy • Select an additional sequence in which to identify the location of the motif • Windows in the additional sequence are aligned to previously established windows (Greedy)

cisGreedy • Additional sequence scanned as before, reverse complement (user specified) • Alignment score established as before

cisGreedy • Final motif locations are used in order to build position specific frequency matrices • Reverse complement sequence used in building PSFM if used

Testing cis-Greedy • AIY • 16bp cis-regulatory motif drives expression • Experimentally verified • Gene battery consists of a set of genes bound by AIY • Orthologous genes contain highly specified binding sites • Individual binding sites of battery genes within a single species can vary considerably (Wenick and Hobert 759)

Cistematic Results for AIY regions of conservation orthologous genes hen-1 hen-1

Results for AIY AIY Identified AAATTGGCTTCCTCAAA cisGreedy TTTGAGGAAGCCAATTT (reverse comp) AAATTGGCTTCCTCAAA meme AAATTGGCTTCCTCAAA AIY- Battery Consensus

Cistematic Results for AIY hen-1

Results for AIY hen-1

Tompa Bakeoff • 3 benchmark datasets • Real • Markov Chain • Generic • 4 organisms • Human • Mouse • Fruitfly • Yeast • Each dataset contains 0-1 motifs. • Each sequence can have 0 or multiple motifs • Report 0-1 motif per dataset and locations of motifs • Use statistical tools provided by bakeoff to analyze runs

Bakeoff example (hm03) • Identify most reasonable motif based in each dataset independently

Real Real Interesting pattern appears between 3 of 10 sequences

Markov

Generic

Bakeoff example (hm03) • Identify most reasonable motif based in each dataset independently • Determine which motif appears most reasonable across 3 benchmarks and map motif in sequences using Cistematic • Compare results to actual locations (provided in bakeoff package)

Solution

Real Real

Solution

Markov

Bakeoff results • Correlation Coefficient: nCC = (nTP nTN - nFN nFP) / √((nTP+nFN)(nTN+nFP)(nTP+nFP)(nTN+nFN)) • Sensitivity (fraction of known sites that are predicted): sSn = sTP / (sTP + sFN) • Positive Predictive Value (fraction of predicted sites that are known.) sPPV = sTP / (sTP + sFP)

Bakeoff results • cisGreedy overall 7th best performer (excluding those with no data): • Overall top performer in fly • Worst performer in yeast • 3rd worst performer in mouse • 4th best performer in human

Bakeoff results Adapted from Tompa, 2005 When running programs in parallel, correlation of motif finder results to true binding sites improves

Future goals • Complete analysis of results for cisGreedy using benchmarks established by Tompa paper (Nature Biotech, 2005) • Document results and algorithm development • Continue improving cisGreedy

References • Bioalgorithms.info • Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms . : MIT Press , 2004. • Hertz, Gerald Z., and Gary D. Stormo. "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 8 1999: 563-577. • Tompa, Martin et al. “Assessing computational tools for the discovery of transcription factor binding sites." Nature Biotechnology January 2005: 137-144. • Wenick, Adam S., and Oliver Hobert. "Genomic cis-Regulatory Architecture and trans-Acting Regulators of a Single Interneuron-Specific Gene Battery in C. elegans." Developmental Cell 6(2005): 757-770. • http://cistematic.caltech.edu

Acknowledgements • Ali Mortazavi • Barbara Wold • Wold Lab funding provided by DOE & NASA • Additional funding by NSF & NIH • SoCalBSI faculty, staff and fellow students

cisGreedy Motif Finder for Cistematic

cisGreedy Motif Finder for Cistematic

Presentation Transcript

Motif discovery

Motif Vessels

Motif search and motif discovery

Motif finding

Motif

Motif Assignment

Motif Finding

Motif finding

Phylogenetic Models for Motif Detection

Motif Finding

Efficient Algorithms for Motif Search

Gibbs sampling for motif finding

Motif Search

Motif

Motif

KFParticle for vertex finder

Algorithms for Regulatory Motif Discovery

Motif

Facsimile Finder for Libraries

Efficient Algorithms for Motif Search

Cannabis Businesses Finder - Cannabis Finder