390 likes | 534 Views
The DoOP database. Endre Sebestyén Agricultural Research Institute of the Hungarian Academy of Sciences Beyond Next Generation Sequencing Workshop Budapest, July 20-23, 2011. Transcription. Transcription factors and binding sites. Transcription factors
E N D
The DoOP database Endre Sebestyén Agricultural Research Institute of the Hungarian Academy of Sciences Beyond Next Generation Sequencing Workshop Budapest, July 20-23, 2011
Transcription factors and binding sites • Transcription factors • Activator domain and DNA binding domain • Recognizes specific sequence motifs • Binding is influenced by various factors • Binding sites • Short sequence motifs (6-12 bp) • Usually in promoter, but theoretically everywhere (3’ and 5’ UTR, introns, etc) • Conserved and ambiguous positions
„Real” promoter structure • No general motifs • No TATA-box, GC-box, etc • Lots of false positive TFBS • With wet-lab and in silico methods • Sometimes no apparent common TFBSs between coregulated genes
Binding site search and promoter analysis • Wet-lab methods • DNAse footprinting • Electrophoretic mobility shift assay • ChIP-Chip, ChIP-Seq • In silico methods • Experimentally verified sites • Consensus sequences • Consensus matrices • De novo motif discovery • Oligo frequency • Phylogenetic footprinting • Other methods
Representation of sites • Consensus sequence • IUPAC nomenclature • Ambiguous positions • ACACTSSNWTT • With repeats • ACACTS{1,4}N{1,2}WTT
Representation of sites • Matrices • Position Frequency Matrix (1.) • Position Weight Matrix (3.) • (Position Scoring Matrix)
Search for sites • Known motifs represented as consensus seqs • Perl regular expressions • if ($seq =~ /[AT]{1,}CCT[CG]/) { print “got it!\n” } • EMBOSS • http://emboss.sourceforge.net/ • Fuzznuc • [CG](5)TG{A}N(1,5)C
Search for sites • Known motifs represented as matrices • TFBS module • Bio::Matrix module • MotifScanner • http://homes.esat.kuleuven.be/~thijs/Work/MotifScanner.html • Using a background model
Search for sites • Denovo motifs • Orthologous genes • Same function in different species’ • Organ specific genes • Tissue specific genes • Developmental state specific genes • Etc
Search for sites • Denovo motifs • Short oligo frequency • Expected vs observed frequency • Over or underrepresented oligos in various promoter groups
Search for sites • Denovo motifs • Phylogenetic footprinting • Functional binding sites and regions should be conserved • Sequence alignment • Global/local – ClustalW/Dialign • Dialign is useful where sequences share only local homologies
Search for sites • M.K. Das and H.-K. Dai, “A survey of DNA motif finding algorithms,” BMC Bioinformatics, vol. 8, 2007. • M. Tompa, N. Li, T.L. Bailey et al., “Assessing computational tools for the discovery of transcription factor binding sites,” Nature Biotechnology, vol. 23, Jan. 2005, pp. 137-44.
Promoter databases • EPD http://epd.vital-it.ch/ • Eukaryotic Promoter Database • Release 107 • Egyik fele kísérletes eredmények alapján (4800) • Maize • Drosophila • Xenopus • Mouse • Human • Etc • Bulk promoter annotation (13000) • Rice
Promoter databases • DBTSS http://dbtss.hgc.jp/ • Database of Transcriptional Start Sites • Release 7.0 • cDNS 5’ sequencing, exact transcription start sites • Alternative promoters too • Species • Mouse • Rat • Fugu • Etc
Promoter databases • PlantProm http://mendel.cs.rhul.ac.uk/mendel.php?topic=plantprom • Növényi promóterek • PromoSer http://biowulf.bu.edu/zlab/PromoSer/ • Ember, egér, patkány • SCPD http://rulai.cshl.edu/SCPD/ • Sacharomyces cerevisiae • DCPD http://www-biology.ucsd.edu/labs/Kadonaga/DCPD.html • Drosophila • CEPDB http://rulai.cshl.edu/cgi-bin/CEPDB/home.cgi • C. elegans • NAR database (january) & webserver (july) issue
Transcription factor binding site databases • TRANSFAC http://www.gene-regulation.com/ • Transcription factors, binding sites, literature data • Matrices and consensus sequences
Transcription factor binding site databases • JASPAR http://jaspar.genereg.net/ • Smaller amount of data • Non redundant • Downloadable in multiple formats • Free
Transcription factor binding site databases • ORegAnno http://www.oreganno.org/ • Open REGulatory ANNOtation database • cisRED http://www.cisred.org/ • Cis-regulatory element database • Based on ENSEMBL • Human, mouse, rat, C. elegans • Place http://www.dna.affrc.go.jp/PLACE/ • PlantCARE http://bioinformatics.psb.ugent.be/webtools/plantcare/html/ • Plant binding sites • Irodalmi adatok alapján
Database of Orthologous Promoters • Collection of orthologous eukaryotic promoters • Two sections • Plant: based on Arabidopsis thaliana genome • Chordate: based on Homo sapiens genome • Aims • Provide a comprehensive promoter collection • Define and annotate conserved regions • Create a search interface for analysis, wet-lab pre screening, etc
DoOP – Cluster subsets • Cluster > Subset • Subset: collection of evolutionary monophyletic sequences in a cluster • Plant subsets • Brassicaceae • Arabidopsis thaliana • Brassicaceae species • Eudicotyledons • Grape, Solanum species, papaya, tobacco • Magnoliophyta • Maize, rice • Viridiplantae
DoOP – Cluster subsets • Chordate subsets • Primates: primates only • Euarchontoglires: rodents • Eutheria: placental mammals • Theria: marsupials (opossum) • Mammalia: all mammals, incl Prototheria (platypus) • Amniota: birds and reptiles • Tetrapoda: amphibians • Teleostomi: most of the fishes • Vertebrata: all vertebrates • Chordata: all chordates, incl Ciona sp.
Modified information content value • 80% of maximum value as seeds • Can drop to 65% • Min 5 nucleotides • Max 20% gaps in each contributing sequence • Max 40% gap or N letters
Motif generation eudicotyledons Magnoliophyta Brassicaceae
DoOP – Web interface • Search in the promoter collections • Annotation • Name • Sequence Ids • Taxons • Sequence • Search in the motifs • Search in the promoters with your own motifs
References • E. Sebestyén, T. Nagy, S. Suhai, and E. Barta, „DoOPSearch: a web-based tool for finding and analysing common conserved motifs in the promoter regions of different chordate and plant genes,” BMC Bioinformatics, vol. 10, Jan. 2009 • E. Barta, E. Sebestyén, T.B. Pálfy, G. Tóth, C.P. Ortutay, and L. Patthy, „DoOP: Databases of Orthologous Promoters, collections of clusters orthologous upstream sequences from chordates and plants,” Nucleic Acids Research, vol. 33, 2005, pp. D86-90.
Contact • http://doop.abc.hu • http://www.slideshare.net/razZ0r • sebestyene@mail.mgki.hu